-
Notifications
You must be signed in to change notification settings - Fork 60
Description
Update EKS Helm Chart for Multi-Container Docker Compose Architecture
Problem
The current Helm chart for deploying the MCP Gateway & Registry on Amazon EKS was designed for a single-container deployment that ran all services within one pod. However, the project has evolved to use Docker Compose with multiple specialized containers for better scalability, maintainability, and separation of concerns.
Current EKS Single-Container Approach
The existing Helm chart at aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow deploys:
- Single Pod: All services running in one container
- Monolithic Setup: Registry, nginx, SSL, and all MCP servers in one deployment
- Single Entrypoint: Uses
/app/docker/entrypoint.shto orchestrate all services - Resource Requirements: GPU-enabled node (g5.2xlarge) with 400Gi EBS storage
- Ports: 80 (HTTP), 443 (HTTPS), 7860 (Registry)
- Health Checks: Single readiness/liveness probe on
/loginendpoint
New Multi-Container Docker Compose Architecture
The project now uses Docker Compose with the following separate services:
Core Services
-
Registry Service (
registry): Main registry with nginx reverse proxy, SSL, FAISS, models, and web UI- Ports: 80, 443, 7860
- Dockerfile:
docker/Dockerfile.registry - Entrypoint:
docker/registry-entrypoint.sh - Dependencies: auth-server
-
Auth Server (
auth-server): Separate authentication service with Amazon Cognito and GitHub OAuth- Port: 8888
- Dockerfile:
docker/Dockerfile.auth - Scalable and independent service
MCP Server Services (Each as Separate Containers)
- Current Time Server (
currenttime-server): Port 8000 - Financial Info Server (
fininfo-server): Port 8001 (requires POLYGON_API_KEY) - MCP Gateway Server (
mcpgw-server): Port 8003 (depends on registry) - Real Server Fake Tools (
realserverfaketools-server): Port 8002
All MCP servers use the same base Dockerfile (docker/Dockerfile.mcp-server) with different SERVER_PATH build args.
Key Differences: Single Container vs Multi-Container
| Aspect | Current EKS (Single Container) | New Docker Compose (Multi-Container) |
|---|---|---|
| Architecture | Monolithic - all services in one pod | Microservices - each service in separate container |
| Scaling | Scale entire application together | Scale individual services independently |
| Resource Usage | GPU required for entire deployment | GPU only needed for specific services (if any) |
| Fault Tolerance | Single point of failure | Service isolation - one service failure doesn't affect others |
| Development | Complex debugging and development | Easier to develop, test, and debug individual services |
| Authentication | Embedded in main service | Dedicated auth service that can be scaled independently |
Required Helm Chart Updates
1. Service Decomposition
Transform from single deployment to multiple Kubernetes deployments:
- Registry Deployment: Main registry with nginx reverse proxy
- Auth Server Deployment: Authentication service
- MCP Server Deployments: Individual deployments for each MCP server (currenttime, fininfo, mcpgw, realserverfaketools)
2. Inter-Service Communication
- Services communicate via Kubernetes service discovery
- Registry depends on auth-server (init containers or readiness probes)
- MCP Gateway server depends on registry
- Internal service URLs need to be configured properly
3. Environment Variables & Configuration
Current single-container approach uses basic environment variables. New multi-container approach needs:
- ConfigMaps: Non-sensitive config (COGNITO_USER_POOL_ID, AWS_REGION, service URLs)
- Secrets: Sensitive data (ADMIN_PASSWORD, SECRET_KEY, COGNITO_CLIENT_SECRET, GITHUB_CLIENT_SECRET, POLYGON_API_KEY)
- Service Discovery: Internal URLs for inter-service communication
4. Resource Optimization
- Current: Single g5.2xlarge GPU node for everything
- New: Right-size resources per service:
- Registry: CPU-optimized for web UI and nginx
- Auth Server: Minimal resources for authentication
- MCP Servers: Lightweight containers for individual tools
- GPU resources only where actually needed
5. Health Checks & Probes
- Current: Single health check on
/loginendpoint - New: Service-specific health checks:
- Registry:
/healthendpoint on port 7860 - Auth Server: Health check on port 8888
- MCP Servers: Individual health checks on their respective ports
- Registry:
6. Persistent Storage
Current approach uses single EBS volume. New approach needs:
- Shared Storage: For MCP server metadata and models (
/opt/mcp-gateway/servers,/opt/mcp-gateway/models) - Logs: Centralized logging volume (
/var/log/mcp-gateway) - SSL Certificates: Shared SSL certificate storage
7. Networking & Ingress
- Current: Single service with multiple ports
- New: Multiple services requiring proper ingress configuration:
- Main registry UI and API
- Individual MCP server endpoints
- Authentication service endpoints
Proposed Migration Strategy
Phase 1: Maintain Compatibility
- Create new multi-container Helm chart alongside existing single-container chart
- Allow users to choose deployment method via values.yaml flag
Phase 2: Multi-Container Implementation
- Implement separate deployments for each service
- Configure proper service discovery and dependencies
- Optimize resource allocation per service
Phase 3: Deprecation
- Mark single-container approach as deprecated
- Provide migration guide for existing deployments
Environment Variables Mapping
Registry Service
- SECRET_KEY: Secret (auto-generated if not provided)
- ADMIN_USER: ConfigMap (default: admin)
- ADMIN_PASSWORD: Secret
- AUTH_SERVER_URL: ConfigMap (internal service URL)
- AUTH_SERVER_EXTERNAL_URL: ConfigMap
- COGNITO_USER_POOL_ID: ConfigMap
- COGNITO_CLIENT_ID: ConfigMap
- COGNITO_CLIENT_SECRET: Secret
- AWS_REGION: ConfigMap (default: us-east-1)
Auth Server
- REGISTRY_URL: ConfigMap (internal service URL)
- SECRET_KEY: Secret (shared with registry)
- GITHUB_CLIENT_ID: ConfigMap
- GITHUB_CLIENT_SECRET: Secret
- COGNITO_USER_POOL_ID: ConfigMap
- COGNITO_CLIENT_ID: ConfigMap
- COGNITO_CLIENT_SECRET: Secret
- AWS_REGION: ConfigMap
MCP Servers
- POLYGON_API_KEY: Secret (for fininfo-server)
- REGISTRY_BASE_URL: ConfigMap (for mcpgw-server)
- REGISTRY_USERNAME: ConfigMap (for mcpgw-server)
- REGISTRY_PASSWORD: Secret (for mcpgw-server)
Testing Requirements
The updated Helm chart should be tested with:
- Fresh EKS cluster deployment with multi-container architecture
- Resource optimization - verify services run with appropriate resource allocation
- Service dependencies - ensure proper startup order and health checks
- SSL certificate configuration across services
- Amazon Cognito integration with separate auth service
- All MCP server functionality through the gateway
- Agent authentication flows (both user identity and agentic identity)
- Scaling scenarios - test independent scaling of services
References
- Current Single-Container Helm Chart: aws-samples EKS ML repo
- New Multi-Container Docker Compose:
docker-compose.yml - Registry Entrypoint Script:
docker/registry-entrypoint.sh - Authentication Documentation:
docs/auth.md - Cognito Setup:
docs/cognito.md
Acceptance Criteria
- Multi-service deployment: Helm chart deploys registry, auth-server, and all MCP servers as separate Kubernetes deployments
- Service dependencies: Proper init containers or readiness probes ensure correct startup order
- Resource optimization: Services use appropriate resource allocation (no unnecessary GPU requirements)
- Inter-service communication: Services can communicate via Kubernetes service discovery
- SSL/TLS termination: Works properly across the multi-service architecture
- Authentication integration: Amazon Cognito works with separate auth service
- MCP server accessibility: All MCP servers accessible through the gateway
- Persistent storage: Shared volumes for logs, models, and server metadata work correctly
- Health checks: Individual service health checks and monitoring work as expected
- Scaling capability: Individual services can be scaled independently
- Migration documentation: Clear guide for migrating from single-container to multi-container deployment
- Backward compatibility: Option to deploy single-container version during transition period
@ajayvohra2005 Your expertise and help would be greatly appreciated for this EKS Helm chart migration from single-container to multi-container architecture. Thank you!