Skip to content

Update EKS Helm Chart for Multi-Container Docker Compose Architecture #48

@aarora79

Description

@aarora79

Update EKS Helm Chart for Multi-Container Docker Compose Architecture

Problem

The current Helm chart for deploying the MCP Gateway & Registry on Amazon EKS was designed for a single-container deployment that ran all services within one pod. However, the project has evolved to use Docker Compose with multiple specialized containers for better scalability, maintainability, and separation of concerns.

Current EKS Single-Container Approach

The existing Helm chart at aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow deploys:

  • Single Pod: All services running in one container
  • Monolithic Setup: Registry, nginx, SSL, and all MCP servers in one deployment
  • Single Entrypoint: Uses /app/docker/entrypoint.sh to orchestrate all services
  • Resource Requirements: GPU-enabled node (g5.2xlarge) with 400Gi EBS storage
  • Ports: 80 (HTTP), 443 (HTTPS), 7860 (Registry)
  • Health Checks: Single readiness/liveness probe on /login endpoint

New Multi-Container Docker Compose Architecture

The project now uses Docker Compose with the following separate services:

Core Services

  • Registry Service (registry): Main registry with nginx reverse proxy, SSL, FAISS, models, and web UI

    • Ports: 80, 443, 7860
    • Dockerfile: docker/Dockerfile.registry
    • Entrypoint: docker/registry-entrypoint.sh
    • Dependencies: auth-server
  • Auth Server (auth-server): Separate authentication service with Amazon Cognito and GitHub OAuth

    • Port: 8888
    • Dockerfile: docker/Dockerfile.auth
    • Scalable and independent service

MCP Server Services (Each as Separate Containers)

  • Current Time Server (currenttime-server): Port 8000
  • Financial Info Server (fininfo-server): Port 8001 (requires POLYGON_API_KEY)
  • MCP Gateway Server (mcpgw-server): Port 8003 (depends on registry)
  • Real Server Fake Tools (realserverfaketools-server): Port 8002

All MCP servers use the same base Dockerfile (docker/Dockerfile.mcp-server) with different SERVER_PATH build args.

Key Differences: Single Container vs Multi-Container

Aspect Current EKS (Single Container) New Docker Compose (Multi-Container)
Architecture Monolithic - all services in one pod Microservices - each service in separate container
Scaling Scale entire application together Scale individual services independently
Resource Usage GPU required for entire deployment GPU only needed for specific services (if any)
Fault Tolerance Single point of failure Service isolation - one service failure doesn't affect others
Development Complex debugging and development Easier to develop, test, and debug individual services
Authentication Embedded in main service Dedicated auth service that can be scaled independently

Required Helm Chart Updates

1. Service Decomposition

Transform from single deployment to multiple Kubernetes deployments:

  • Registry Deployment: Main registry with nginx reverse proxy
  • Auth Server Deployment: Authentication service
  • MCP Server Deployments: Individual deployments for each MCP server (currenttime, fininfo, mcpgw, realserverfaketools)

2. Inter-Service Communication

  • Services communicate via Kubernetes service discovery
  • Registry depends on auth-server (init containers or readiness probes)
  • MCP Gateway server depends on registry
  • Internal service URLs need to be configured properly

3. Environment Variables & Configuration

Current single-container approach uses basic environment variables. New multi-container approach needs:

  • ConfigMaps: Non-sensitive config (COGNITO_USER_POOL_ID, AWS_REGION, service URLs)
  • Secrets: Sensitive data (ADMIN_PASSWORD, SECRET_KEY, COGNITO_CLIENT_SECRET, GITHUB_CLIENT_SECRET, POLYGON_API_KEY)
  • Service Discovery: Internal URLs for inter-service communication

4. Resource Optimization

  • Current: Single g5.2xlarge GPU node for everything
  • New: Right-size resources per service:
    • Registry: CPU-optimized for web UI and nginx
    • Auth Server: Minimal resources for authentication
    • MCP Servers: Lightweight containers for individual tools
    • GPU resources only where actually needed

5. Health Checks & Probes

  • Current: Single health check on /login endpoint
  • New: Service-specific health checks:
    • Registry: /health endpoint on port 7860
    • Auth Server: Health check on port 8888
    • MCP Servers: Individual health checks on their respective ports

6. Persistent Storage

Current approach uses single EBS volume. New approach needs:

  • Shared Storage: For MCP server metadata and models (/opt/mcp-gateway/servers, /opt/mcp-gateway/models)
  • Logs: Centralized logging volume (/var/log/mcp-gateway)
  • SSL Certificates: Shared SSL certificate storage

7. Networking & Ingress

  • Current: Single service with multiple ports
  • New: Multiple services requiring proper ingress configuration:
    • Main registry UI and API
    • Individual MCP server endpoints
    • Authentication service endpoints

Proposed Migration Strategy

Phase 1: Maintain Compatibility

  • Create new multi-container Helm chart alongside existing single-container chart
  • Allow users to choose deployment method via values.yaml flag

Phase 2: Multi-Container Implementation

  • Implement separate deployments for each service
  • Configure proper service discovery and dependencies
  • Optimize resource allocation per service

Phase 3: Deprecation

  • Mark single-container approach as deprecated
  • Provide migration guide for existing deployments

Environment Variables Mapping

Registry Service

  • SECRET_KEY: Secret (auto-generated if not provided)
  • ADMIN_USER: ConfigMap (default: admin)
  • ADMIN_PASSWORD: Secret
  • AUTH_SERVER_URL: ConfigMap (internal service URL)
  • AUTH_SERVER_EXTERNAL_URL: ConfigMap
  • COGNITO_USER_POOL_ID: ConfigMap
  • COGNITO_CLIENT_ID: ConfigMap
  • COGNITO_CLIENT_SECRET: Secret
  • AWS_REGION: ConfigMap (default: us-east-1)

Auth Server

  • REGISTRY_URL: ConfigMap (internal service URL)
  • SECRET_KEY: Secret (shared with registry)
  • GITHUB_CLIENT_ID: ConfigMap
  • GITHUB_CLIENT_SECRET: Secret
  • COGNITO_USER_POOL_ID: ConfigMap
  • COGNITO_CLIENT_ID: ConfigMap
  • COGNITO_CLIENT_SECRET: Secret
  • AWS_REGION: ConfigMap

MCP Servers

  • POLYGON_API_KEY: Secret (for fininfo-server)
  • REGISTRY_BASE_URL: ConfigMap (for mcpgw-server)
  • REGISTRY_USERNAME: ConfigMap (for mcpgw-server)
  • REGISTRY_PASSWORD: Secret (for mcpgw-server)

Testing Requirements

The updated Helm chart should be tested with:

  1. Fresh EKS cluster deployment with multi-container architecture
  2. Resource optimization - verify services run with appropriate resource allocation
  3. Service dependencies - ensure proper startup order and health checks
  4. SSL certificate configuration across services
  5. Amazon Cognito integration with separate auth service
  6. All MCP server functionality through the gateway
  7. Agent authentication flows (both user identity and agentic identity)
  8. Scaling scenarios - test independent scaling of services

References

Acceptance Criteria

  • Multi-service deployment: Helm chart deploys registry, auth-server, and all MCP servers as separate Kubernetes deployments
  • Service dependencies: Proper init containers or readiness probes ensure correct startup order
  • Resource optimization: Services use appropriate resource allocation (no unnecessary GPU requirements)
  • Inter-service communication: Services can communicate via Kubernetes service discovery
  • SSL/TLS termination: Works properly across the multi-service architecture
  • Authentication integration: Amazon Cognito works with separate auth service
  • MCP server accessibility: All MCP servers accessible through the gateway
  • Persistent storage: Shared volumes for logs, models, and server metadata work correctly
  • Health checks: Individual service health checks and monitoring work as expected
  • Scaling capability: Individual services can be scaled independently
  • Migration documentation: Clear guide for migrating from single-container to multi-container deployment
  • Backward compatibility: Option to deploy single-container version during transition period

@ajayvohra2005 Your expertise and help would be greatly appreciated for this EKS Helm chart migration from single-container to multi-container architecture. Thank you!

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions