Skip to content

Conversation

@aarora79
Copy link
Contributor

feat: Production AWS ECS deployment with improved documentation and security

Summary

This PR implements a complete production-ready AWS ECS Fargate deployment for MCP Gateway Registry with comprehensive documentation improvements and enhanced security features.

Major Features

☁️ AWS ECS Production Deployment

  • Multi-AZ Architecture: High availability across 3 availability zones
  • Application Load Balancer: HTTPS/SSL termination with ACM certificates
  • Auto-scaling: Dynamic scaling based on CPU and memory utilization
  • CloudWatch Integration: Comprehensive monitoring, logging, and alerting
  • NAT Gateway HA: High-availability NAT gateway configuration
  • Keycloak Integration: Enterprise authentication with RDS Aurora PostgreSQL backend
  • EFS Shared Storage: Persistent storage for models, logs, and configuration
  • Service Discovery: AWS Cloud Map for service-to-service communication

📚 Documentation Improvements

  • Complete ECS deployment guide with step-by-step regional deployment instructions
  • Time estimates for all deployment stages (~30-40 minutes total)
  • Accurate container image sizes validated against ECR (~9.8GB total across 7 images)
  • Clear distinction between regional and custom domain configurations
  • Comprehensive troubleshooting and operations guides

🔒 Security Enhancements

  • Mandatory Password Requirements: INITIAL_ADMIN_PASSWORD now required as environment variable (no defaults)
  • Password Distinction Documentation: Clear separation between Keycloak master admin and realm admin passwords
  • Region-Agnostic Configuration: Generic placeholders (YOUR_AWS_REGION, YOUR_ACCOUNT_ID) to prevent hardcoded values
  • Network Access Control: Comprehensive options for IP allowlisting with security best practices

🛠️ Build System Improvements

  • Unified Build Configuration: build-config.yaml as single source of truth
  • Regional Support: Automatic ECR repository creation and region-specific builds
  • Consolidated Scripts: Replaced multiple shell scripts with streamlined build system
  • Image Manifest Generation: Automated tracking of container images and sizes

🤖 A2A Agent Deployment

  • Flight booking and travel assistant agents integrated into ECS deployment
  • Proper Docker build contexts and multi-stage builds
  • Registry client improvements with group API fixes

Files Changed

New Files (48)

AWS ECS Terraform Infrastructure:

  • terraform/aws-ecs/ - Complete production deployment configuration
    • Main infrastructure files: main.tf, ecs.tf, vpc.tf, variables.tf, outputs.tf
    • Keycloak resources: keycloak-*.tf (ALB, database, DNS, ECR, ECS, security groups)
    • MCP Gateway module: modules/mcp-gateway/ with networking, storage, monitoring, IAM
    • Comprehensive README.md with deployment guide
    • Example configuration: terraform.tfvars.example

Scripts and Automation:

  • terraform/aws-ecs/scripts/ - Management and initialization scripts
    • init-keycloak.sh - Keycloak realm and user setup
    • service_mgmt.sh, user_mgmt.sh - Operations tooling
    • view-cloudwatch-logs.sh - Log monitoring
  • scripts/build-images.sh - Unified container build system
  • scripts/generate-image-manifest.sh - Image size tracking

API and Examples:

  • api/ - Standalone API client and management tools
  • cli/examples/ - Additional MCP server and agent examples

Documentation:

  • docs/api-specs/ - OpenAPI specifications for A2A, auth, and server management

Modified Files (29)

Core Improvements:

  • README.md - Updated with ECS deployment information and roadmap
  • Makefile - Consolidated build targets and A2A agent helpers
  • build-config.yaml - Region-agnostic container image configuration
  • .gitignore - Added terraform state and local configuration exclusions

Application Updates:

  • registry/ - Agent and server route improvements, auth enhancements
  • auth_server/ - OAuth provider configuration updates
  • docker/ - Updated nginx configurations and new Dockerfile for scopes-init

Agent Dockerfiles:

  • agents/a2a/src/*/Dockerfile - Fixed build context paths

Breaking Changes

⚠️ Configuration Changes Required

  1. INITIAL_ADMIN_PASSWORD is now mandatory in init-keycloak.sh (no default fallback)
  2. Container Image URIs must use YOUR_AWS_REGION and YOUR_ACCOUNT_ID placeholders in terraform.tfvars.example

Migration Guide

For existing deployments:

  1. Set INITIAL_ADMIN_PASSWORD environment variable before running init-keycloak.sh
  2. Update container image URIs in terraform.tfvars with your actual AWS region and account ID

Testing

Infrastructure Validation

  • ✅ Deployed successfully in us-east-1
  • ✅ Multi-AZ architecture verified across 3 availability zones
  • ✅ HTTPS/SSL certificates validated with ACM
  • ✅ Auto-scaling tested with CPU/memory thresholds
  • ✅ CloudWatch alarms and monitoring functional
  • ✅ Keycloak integration with Aurora PostgreSQL working
  • ✅ EFS shared storage accessible across tasks

Container Images

  • ✅ All 7 container images built and pushed to ECR
  • ✅ Validated actual sizes match documentation (~9.8GB total)
  • ✅ A2A agents (flight_booking, travel_assistant) deployed successfully

Security

  • ✅ INITIAL_ADMIN_PASSWORD requirement enforced
  • ✅ No default passwords remain in codebase
  • ✅ Network access controls functional with CIDR blocks
  • ✅ SSL/TLS encryption verified on all HTTPS endpoints

Performance

  • Build Time: ~25-30 minutes for all container images
  • Deployment Time: ~30-40 minutes for complete regional deployment
  • Total Container Size: ~9.8GB across 7 images (validated from ECR)

Deployment Time Estimates

Stage Time Notes
Container Build & Push ~25-30 min All 7 images to ECR
SSL Certificate Creation ~5 min ACM validation
Infrastructure Deploy ~10 min ECS, ALB, VPC, RDS
DNS Propagation ~10 min Route53 record creation
Keycloak Initialization ~5 min Realm and user setup
Total ~30-40 min Complete regional deployment

Documentation Updates

  • terraform/aws-ecs/README.md - Comprehensive deployment guide (1,800+ lines)
  • README.md - Updated main README with ECS deployment information
  • Password security and distinction clearly documented
  • Regional vs custom domain configuration explained
  • Troubleshooting and operations guides included

Closes

Additional Notes

  • This deployment has been tested in us-east-1 with multi-AZ architecture
  • All container images have been validated against actual ECR data
  • Documentation includes production-ready security best practices
  • Regional domain support enables multi-region deployments
  • Cost estimates and optimization guidance included in documentation

Checklist

  • Code follows project style guidelines
  • Documentation updated
  • Security best practices implemented
  • Infrastructure tested in AWS
  • Container images validated
  • No hardcoded credentials or regions
  • Breaking changes documented with migration guide

Gaurav Rele and others added 30 commits November 7, 2025 17:50
Keycloak was enforcing HTTPS at the token endpoint even though KC_HTTP_ENABLED=true
and KC_PROXY=edge were configured. The issue was missing KC_PROXY_ADDRESS_FORWARDING
environment variable.

When Keycloak is behind an ALB:
- ALB forwards HTTP requests to Keycloak container internally
- ALB sets X-Forwarded-* headers to indicate the client protocol
- Without KC_PROXY_ADDRESS_FORWARDING, Keycloak ignores these headers
- Result: Keycloak only sees internal HTTP and defaults to HTTPS enforcement

With KC_PROXY_ADDRESS_FORWARDING=true:
- Keycloak trusts the proxy headers from the ALB
- Recognizes that clients are using HTTP (as per X-Forwarded-Proto)
- Allows HTTP connections to OAuth2 endpoints without redirect
- Enables the disable-ssl.sh script to obtain admin tokens

This fix allows the Keycloak SSL disabling script to work correctly.
Corrected the frontendUrl parameter from a top-level property to the
attributes object, which is the proper way to set it via the REST API
according to the RealmRepresentation schema.

Changes:
- Moved frontendUrl from top-level to realm attributes
- Updated request body to: {"attributes": {"frontendUrl": "..."}}
- Added clarification that this may not be supported in all versions
- Maintains backward compatibility with the main sslRequired setting

The sslRequired parameter is still correct with value "none" according
to the Keycloak documentation.
This script allows disabling SSL requirements on Keycloak realms
via the REST API. It can be used if SSL disabling is needed in the future.

The script:
- Fetches Keycloak admin password from AWS Secrets Manager
- Obtains admin token via OAuth2 password grant
- Disables SSL for master and mcp-gateway realms
- Includes verbose logging for troubleshooting

Usage:
  VERBOSE=1 KEYCLOAK_URL=http://keycloak:8080 ./keycloak/setup/disable-ssl.sh

Note: This script is for optional use and is NOT enabled in the
Keycloak ECS configuration by default.
- Add .gitignore entries for Terraform user files (terraform.tfvars, .terraform, crash logs)
- Configure ALB as internet-facing with specific IP allowlists instead of 0.0.0.0/0
- Add Keycloak ALB configuration variables for separate network control
- Update ingress CIDR blocks to use specific IP addresses (laptop + EC2 instance)
- Improve terraform.tfvars.example with comprehensive documentation
- Add Keycloak OAuth2 client secret variables to root configuration
- Restrict network access to known IPs for improved security
- Separate Keycloak ALB configuration from main ALB configuration
- Deleted database.tf (PostgreSQL only for Keycloak)
- Removed all Keycloak secrets (database, admin, client)
- Removed Keycloak ALB and listeners
- Removed Keycloak ECS service (was using start-dev mode)
- Removed all Keycloak variables and outputs
- Removed Keycloak CloudWatch alarms
- Removed RDS alarms for Keycloak database
- Total: 13 files modified, 1 file deleted

Files Modified:
- terraform/aws-ecs/main.tf
- terraform/aws-ecs/modules/mcp-gateway/ecs-services.tf
- terraform/aws-ecs/modules/mcp-gateway/iam.tf
- terraform/aws-ecs/modules/mcp-gateway/locals.tf
- terraform/aws-ecs/modules/mcp-gateway/monitoring.tf
- terraform/aws-ecs/modules/mcp-gateway/networking.tf
- terraform/aws-ecs/modules/mcp-gateway/outputs.tf
- terraform/aws-ecs/modules/mcp-gateway/secrets.tf
- terraform/aws-ecs/modules/mcp-gateway/variables.tf
- terraform/aws-ecs/outputs.tf
- terraform/aws-ecs/variables.tf

Files Deleted:
- terraform/aws-ecs/modules/mcp-gateway/database.tf

Verification Tests Passed:
✓ No Keycloak references in .tf or .tfvars files
✓ Terraform validate succeeds
✓ Terraform plan shows destructions only
✓ database.tf successfully deleted

Ref: docs/keycloak-integration/keycloak-removal-checklist.md
- Uses 'start --optimized' instead of 'start-dev'
- Pre-builds Keycloak for production
- Copied from working aws-ecs-keycloak repository
Added files:
- keycloak-database.tf: Aurora MySQL Serverless v2 + RDS Proxy
- keycloak-ecs.tf: ECS service with production mode
- keycloak-security-groups.tf: Security groups integrated with VPC
- keycloak-alb.tf: Application Load Balancer with HTTPS
- keycloak-dns.tf: Route53 zone and ACM certificate
- keycloak-ecr.tf: ECR repository for Docker images
- locals.tf: Common tags for all resources

Key integrations:
- Uses EXISTING VPC (module.vpc.vpc_id)
- Production-ready configuration
- Auto-validated SSL certificate
- RDS Proxy for connection pooling
- CloudWatch logging and monitoring
- ECS auto-scaling based on CPU/memory
- Deployed to us-west-2
- Create registry-dns.tf with Route 53 DNS configuration
- Add A record for registry.mycorp.click pointing to main ALB
- Generate ACM certificate for registry.mycorp.click
- Auto-validate certificate with DNS challenge
- Add registry_url, registry_certificate_arn outputs
- Enables HTTPS support for main registry service
Phases completed:
- Phase 0: Removed all 284 Keycloak references from broken implementation
- Phase 1-2: Copied production Dockerfile (uses start --optimized)
- Phase 3: Added 6 Keycloak Terraform files (968 lines)
- Phase 4: Built and pushed Docker image to ECR
- Phase 5: Deployed Keycloak infrastructure with Terraform
- Phase 6: Configured DNS and SSL certificates

New infrastructure deployed:
- Keycloak: kc.mycorp.click (Aurora MySQL, ECS, ALB, ACM)
- Registry: registry.mycorp.click (DNS, ACM cert, linked to main ALB)
- VPC: 10.0.0.0/16 with 3 AZs
- ECS Services: Keycloak, Auth Server, Registry (all running/starting)
- Security: Proper security groups, IAM roles, secrets management

Files created:
- keycloak-database.tf: Aurora MySQL Serverless v2 + RDS Proxy
- keycloak-ecs.tf: ECS service with auto-scaling
- keycloak-security-groups.tf: Security group rules
- keycloak-alb.tf: Application Load Balancer
- keycloak-dns.tf: Route53 + ACM certificate
- keycloak-ecr.tf: ECR repository
- registry-dns.tf: Registry DNS + certificate
- docker/keycloak/Dockerfile: Production-ready image

Configuration:
- Admin: admin (password in SSM Parameter Store)
- Database: Serverless Aurora MySQL 0.5-2 ACU
- Region: us-west-2
- Auto-scaling: Enabled for all services
- Monitoring: CloudWatch alarms configured

Deployment verified:
- DNS resolution: Both domains resolve to ALB IPs
- Keycloak health: 503 (service starting up)
- ECS services: Auth Server running (2/2), Registry pending (0/2)
- Security groups and certificates properly configured

Next steps:
- Wait for all services to reach running state
- Configure Keycloak realms and clients
- Link main ALB HTTPS to registry certificate
- Run end-to-end authentication tests
- Created build-and-push-keycloak.sh: Automated ECR push script
  - Build Docker image from Dockerfile
  - Auto-login to AWS ECR
  - Tag and push to ECR with configurable tags
  - Verify push success
  - Support for custom regions/profiles
  - Color-coded output and error handling

- Updated Makefile with Keycloak targets
  - make build-keycloak: Build locally
  - make build-and-push-keycloak: Build and push to ECR
  - make deploy-keycloak: Deploy to ECS
  - make update-keycloak: Full workflow (build+push+deploy)
  - Support for AWS_REGION, AWS_PROFILE, IMAGE_TAG variables

- Added scripts/README.md documentation
  - Complete usage examples
  - Troubleshooting guide
  - Option reference
  - Prerequisites and features

Replaces manual build/push steps with automated, repeatable process.
Simplifies future Keycloak image updates and deployments.
- Created save-terraform-outputs.sh: Automated outputs export script
  - Exports all terraform outputs to text or JSON format
  - Creates formatted, readable output file
  - Automatic backup of previous outputs
  - Shows key infrastructure URLs and details
  - Color-coded logging and progress tracking

- Updated Makefile with output export targets
  - make save-outputs: Export as formatted text
  - make save-outputs-json: Export as JSON
  - Added to help documentation

- Generated initial terraform-outputs.txt
  - Documents all deployed resources
  - Contains all service URLs
  - Includes deployment summary and metadata
  - Ready for archival and documentation

This provides clear documentation of all deployed resources
and makes it easy to regenerate outputs as infrastructure changes.
- Create view-cloudwatch-logs.sh: New script to view CloudWatch logs for all ECS services (keycloak, registry, auth-server) with support for live tailing, time range filtering, and pattern matching
- Simplify save-terraform-outputs.sh: Now outputs JSON-only format for better machine readability
- Move build-and-push-keycloak.sh to terraform/aws-ecs/scripts/ directory for better organization
- Move save-terraform-outputs.sh to terraform/aws-ecs/scripts/ directory
- Create comprehensive README.md for scripts directory documenting all utilities
- Update Makefile to add view-logs targets and update script paths

New make targets:
- make view-logs: View all component logs from last 30 minutes
- make view-logs-keycloak: View Keycloak logs only
- make view-logs-registry: View Registry logs only
- make view-logs-auth: View Auth Server logs only
- make view-logs-follow: Follow all logs in real-time

Features:
- CloudWatch logs script supports --minutes, --follow, --component, --filter options
- All scripts have color-coded output for easy readability
- Automated backup of previous terraform outputs
- AWS CLI integration for fetching logs
…roxy

- Changed database URL from RDS Proxy endpoint to direct RDS cluster endpoint
- Added depends_on relationship between proxy target and RDS instance
- This resolves the 'Communications link failure' error Keycloak was experiencing
- Direct connection is more reliable for Serverless v2 Aurora
- Changed output location from terraform/ to terraform/aws-ecs/scripts/
- Updated TERRAFORM_DIR to include full path from repo root
- Added OUTPUT_DIR variable pointing to script directory
- Updated documentation comments to reflect new location
- Tested and verified script works correctly
- Output file now in same directory as the script for easier access
- Added load_from_terraform_outputs() function to read from terraform-outputs.json
- Script now automatically loads ALB DNS names from saved terraform outputs
- Greatly simplifies usage - only requires 3 env vars instead of 5
- Falls back gracefully if terraform-outputs.json not found or jq not available
- Updated INIT-KEYCLOAK.md with simplified usage examples
- Prioritizes explicitly set environment variables over JSON values
- Script now retrieves KEYCLOAK_ADMIN_PASSWORD from SSM if not set via env var
- Uses AWS CLI to fetch /keycloak/admin_password parameter
- Falls back to environment variable if SSM unavailable
- Eliminates need to manually pass admin password when AWS credentials available
- Shows helpful error message if neither source provides password
- Greatly simplifies script usage in automated deployments
Major improvements:
- Fix all jq parsing errors in init-keycloak.sh with proper type checking
- Add token expiration handling to prevent 401 authentication errors
- Create smart JWT token management with SSM caching (get-m2m-token.sh)
- Add user and service management scripts for cloud deployment
- Update mcp_client.py to support OAUTH_TOKEN environment variable
- Add comprehensive post-deployment documentation to README
- Update .gitignore for Terraform plan and backup files
- Fix Keycloak database configuration and networking setup

Scripts:
- terraform/aws-ecs/scripts/get-m2m-token.sh: Smart token retrieval with SSM cache
- terraform/aws-ecs/scripts/user_mgmt.sh: M2M and human user management
- terraform/aws-ecs/scripts/service_mgmt.sh: MCP server registration
- terraform/aws-ecs/scripts/init-keycloak.sh: Enhanced with robust error handling

All scripts now support environment variable overrides with automatic fallback
to terraform-outputs.json and AWS SSM Parameter Store.
Major changes:
- Add Python registry management client with Pydantic models
- Fix OAuth login by adding AUTH_SERVER_URL environment variable
- Add EFS volume mount for auth server scopes.yml configuration
- Fix ALB security group to allow public access (0.0.0.0/0)
- Fix SSM put-parameter output contamination in get-m2m-token.sh
- Add service account name fallback (service-account- prefix)
- Add registry URL configuration priority (env > terraform-outputs.json > default)

Python Client:
- registry_client.py: Core API client with Pydantic models
- registry_management.py: CLI wrapper for registry operations
- Token retrieval via get-m2m-token.sh subprocess
- Token redaction in logs (show only first 8 characters)
- Debug logging support with --debug flag

Infrastructure:
- Add auth_config EFS access point for runtime configuration
- Add copy-scopes-to-efs.sh script for initial setup
- Update scopes.yml loading to check SCOPES_CONFIG_PATH env var
- Change default ingress_cidr_blocks from specific IPs to 0.0.0.0/0

Token Management:
- Fix SSM parameter double JSON parsing
- Redirect aws ssm put-parameter output to prevent contamination
- Support both original and service-account- prefixed client names
- Check both SSM parameter locations for tokens
Added Pydantic models and methods for the A2A (Agent-to-Agent) agent
management API based on docs/api-specs/a2a-agent-management.yaml.

New Pydantic Models:
- AgentProvider, AgentVisibility, SecuritySchemeType enums
- AgentRegistration: Agent registration request
- AgentCard: Agent summary view
- AgentDetail: Detailed agent information
- Skill, SkillDetail: Agent capability models
- AgentListResponse, AgentToggleResponse: Operation responses
- SkillDiscoveryRequest, AgentDiscoveryResponse: Skill-based discovery
- SemanticDiscoveredAgent, AgentSemanticDiscoveryResponse: Semantic search

New RegistryClient Methods:
- register_agent(): Register new A2A agent
- list_agents(): List agents with filtering (query, enabled_only, visibility)
- get_agent(): Get detailed agent information
- update_agent(): Update existing agent
- delete_agent(): Remove agent from registry
- toggle_agent(): Enable/disable agent
- discover_agents_by_skills(): Find agents by required skills
- discover_agents_semantic(): NLP semantic search using FAISS

All methods include proper type hints, docstrings, and error handling
for HTTP status codes (404, 403, 409, 422, 400, 500).
Extended the CLI to support all A2A (Agent-to-Agent) agent management
operations from the registry client.

New CLI Commands:
- agent-register: Register new agent from JSON config
- agent-list: List agents with filtering (query, enabled_only, visibility)
- agent-get: Get detailed agent information
- agent-update: Update existing agent from JSON config
- agent-delete: Delete agent with confirmation prompt
- agent-toggle: Enable/disable agent
- agent-discover: Discover agents by required skills (comma-separated)
- agent-search: Semantic search using natural language queries

Command Features:
- JSON config file support for registration/updates
- Automatic enum conversion (provider, visibility)
- Skill object construction from JSON
- Pretty-printed JSON output for structured data
- Confirmation prompts for destructive operations
- Support for filtering and search parameters
- Max results limits for discovery operations

Example Usage:
  # Register agent
  uv run python registry_management.py agent-register --config agent.json

  # List enabled agents
  uv run python registry_management.py agent-list --enabled-only

  # Discover by skills
  uv run python registry_management.py agent-discover \
    --skills code_analysis,bug_detection --max-results 5

  # Semantic search
  uv run python registry_management.py agent-search \
    --query "agents that analyze code"

Updated module docstring with comprehensive examples for all agent
management operations.
Enhanced cmd_agent_register and cmd_agent_update to handle real-world
JSON configurations more gracefully.

Improvements:
- Support both 'input_schema' and 'parameters' field names for skills
- Support both 'name' and 'id' field names for skill names
- Map provider values flexibly ('Example Corp' -> 'custom')
- Map security scheme types (OpenAPI 'http' -> 'bearer')
- Filter out extra fields not in AgentRegistration model
- Better error logging with exc_info for debugging
- Graceful handling of unknown enum values with warnings

Provider Mapping:
- Accepts 'anthropic', 'custom', 'other' (exact matches)
- Maps 'Example Corp', 'example' to 'custom'
- Unknown values default to 'custom' with warning

Security Scheme Mapping:
- Maps OpenAPI 'http' type to 'bearer'
- Supports 'bearer', 'apikey'/'api_key', 'oauth2'
- Unknown types default to 'bearer'

This allows the CLI to work with various JSON formats including
OpenAPI-style agent specifications without manual conversion.
This commit addresses multiple issues with agent registration, Keycloak configuration, and OAuth2 authentication:

1. Synchronized Pydantic models between client and server
   - Added missing fields to SecurityScheme model (scheme, bearer_format, in_, name, flows, openid_connect_url)
   - Updated Skill model: made id required, renamed input_schema to parameters, added tags
   - Fixed AgentListItem/AgentListResponse: moved total_count to response level, added missing fields
   - Fixed security_schemes transformation to preserve all fields from original JSON

2. Enhanced init-keycloak.sh with LOB groups and service accounts
   - Added registry-admins, registry-users-lob1, registry-users-lob2 groups
   - Created service account clients: registry-admin-bot, lob1-bot, lob2-bot
   - Created LOB users: lob1-user, lob2-user
   - Added proper group assignments for all entities

3. Fixed OAuth2 client secret persistence
   - Stored mcp-gateway-web client secret in AWS Secrets Manager
   - Added SSM Parameter Store backup for secret persistence
   - Ensures auth server survives terraform redeployments

These changes fix validation errors during agent registration and prevent OAuth2 authentication failures after ECS task redeployments.
Ubuntu and others added 27 commits November 21, 2025 05:23
Some MCP servers (currenttime, realserverfaketools) don't allow multiple concurrent
sessions on the same streamable-http endpoint. This causes tool fetches to fail with
400 Bad Request when attempted immediately after health checks.

Added 0.5 second delay before tool fetch to ensure the health check session
is properly closed before attempting to establish a new session for tool retrieval.

This fixes the 0 tools issue for currenttime and realserverfaketools servers.
The remove endpoint expects form field named 'path' but the client was sending
'service_path'. This caused FastAPI validation to fail with 422 Unprocessable Entity
because the required 'path' parameter was missing.

Changed data dict key from 'service_path' to 'path' to match the endpoint's Form()
parameter declaration.
Enhanced Makefile with new targets for local A2A agent development:
- compose-up-agents: Start agents with docker-compose locally
- compose-down-agents: Stop local agents
- compose-logs-agents: Follow agent logs in real-time
- build-agents: Build both agent images locally
- push-agents: Push both agent images to ECR

These targets simplify the workflow for developing, testing, and deploying
the Flight Booking Agent and Travel Assistant Agent A2A services.
The agent Dockerfiles were copying '.' which was copying the entire
agents/a2a directory tree when the build context is agents/a2a. This caused
agent.py to not be found at the correct location.

Updated COPY commands to explicitly reference the correct source paths:
- flight-booking-agent: COPY src/flight-booking-agent/ ./
- travel-assistant-agent: COPY src/travel-assistant-agent/ ./

This ensures agent.py and all other agent code is correctly copied into the
/app directory inside the container, fixing the "Failed to spawn: agent.py"
errors in the logs.

Root cause: The Dockerfiles are located in src/{agent}/ but the build context
in build-config.yaml is agents/a2a, so relative paths needed to account for this.
The build-images.sh script was using context 'agents/a2a' which caused
agent.py to be placed at /app/src/flight-booking-agent/agent.py instead
of /app/agent.py, leading to "Failed to spawn: agent.py" errors in ECS.

Changes:
- Update build-config.yaml to use agent-specific contexts:
  * flight_booking_agent: agents/a2a/src/flight-booking-agent
  * travel_assistant_agent: agents/a2a/src/travel-assistant-agent
- Update build-images.sh setup_a2a_agent() to copy dependencies from
  agents/a2a level to each agent's .tmp/ directory
- Simplify Dockerfile comments to remove confusing dual-context notes

This aligns build-images.sh behavior with docker-compose.local.yml,
which also uses agent-specific contexts. Both systems now correctly
place agent.py at /app/agent.py within the container.

Verified all required files exist and build system is consistent.
…ectory

- Created /api directory with standalone registry management scripts
- Added Anthropic Registry API v0.1 client methods:
  - anthropic_list_servers() - List all servers
  - anthropic_list_server_versions() - List versions for a server
  - anthropic_get_server_version() - Get server details
- Integrated Anthropic API commands into registry_management.py:
  - anthropic-list: List all servers with optional --raw JSON output
  - anthropic-versions: List versions for a specific server
  - anthropic-get: Get detailed server information
- Made REGISTRY_URL environment variable mandatory in /api scripts
- Removed terraform-outputs.json dependency from /api scripts
- Added registry URL cascading lookup to anthropic_registry_example.py
- Updated both terraform/aws-ecs/scripts versions to support Anthropic API

Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Updated default efs_throughput_mode from 'provisioned' to 'bursting'
- Added comment explaining bursting mode is FREE and recommended
- Bursting mode provides up to 100 MiB/s which is sufficient for registry operations
- Provisioned mode costs $6/MiB/s-month and should only be used when proven necessary

This change will significantly reduce AWS costs for new deployments.

Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Unified container builds: Update Makefile to use centralized build-config.yaml
- Remove redundant build-and-push scripts (4 scripts superseded by unified build system)
- Relocate client tools: Move registry_client.py and registry_management.py to /api
- Improve backup management: Store terraform-outputs.json backups in .terraform/
- Rename setup-keycloak-client.sh to rotate-keycloak-web-client-secret.sh for clarity
- Remove obsolete deploy-currenttime-ecs.sh (terraform handles deployments)
- Remove keycloak-integration docs folder (outdated planning documents)

Changes:
- Makefile: Keycloak build targets now use unified build system
- 8 scripts removed from terraform/aws-ecs/scripts/
- save-terraform-outputs.sh: Backups now stored in gitignored .terraform/
- rotate-keycloak-web-client-secret.sh: Improved documentation and naming
…in support

- Fixed all 5 group management API endpoints (add/remove-from-groups, create/delete-group, list-groups)
  - Added missing Request parameter to all endpoints
  - Fixed parameter name mismatches (server_path→server_name, groups→group_names)
  - Added missing optional parameters (description, create_in_keycloak, delete_from_keycloak, force)

- Removed hardcoded us-west-2 region references (12 instances)
  - init-keycloak.sh: 5 fixes
  - rotate-keycloak-web-client-secret.sh: 1 fix
  - user_mgmt.sh: 1 fix
  - main.tf: Changed registry_image_uri to use variable
  - Deleted deprecated copy-scopes-to-efs.sh

- Implemented regional domain support
  - Added use_regional_domains flag with base_domain variable
  - Domains now auto-generate as kc.{region}.mycorp.click and registry.{region}.mycorp.click
  - Updated all terraform files to use local.keycloak_domain and local.root_domain
  - Supports both regional and static domain modes

- Fixed metrics-service Dockerfile COPY paths to match build context

- Updated terraform.tfvars.example with all variables and comprehensive documentation
- Production-grade documentation (1,703 lines) for AWS ECS infrastructure
- Complete architecture explanation with diagram
- Step-by-step regional deployment guide
- Prerequisites with tool versions and IAM policies
- Post-deployment checklist (7 steps)
- Container build and deployment workflows
- Complete developer workflow from code to deployment
- Troubleshooting guide with real error scenarios
- Cost optimization strategies ($110-250/month breakdown)
- Security best practices and backup procedures
- Quick reference cheat sheet for common commands
- All links verified, no hardcoded credentials
- Ready for GitHub with professional formatting
…ariable

Problem:
The build-images.sh and generate-image-manifest.sh scripts were parsing
the ECR registry URL directly from build-config.yaml which had a hardcoded
us-west-2 region. This caused ECR authentication failures when deploying
to other regions even when AWS_REGION environment variable was set.

Error was:
  Error response from daemon: login attempt to
  https://605134468121.dkr.ecr.us-west-2.amazonaws.com/v2/ failed
  with status: 400 Bad Request

Root cause:
Line 50 in build-images.sh:
  ECR_REGISTRY=$(grep 'ecr_registry:' "$CONFIG_FILE" ...)
This parsed the hardcoded us-west-2 URL from build-config.yaml regardless
of the AWS_REGION environment variable.

Solution:
1. Updated scripts/build-images.sh:
   - Construct ECR_REGISTRY dynamically using AWS_REGION env var
   - Changed from parsing config to:
     ECR_REGISTRY="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"
   - Defaults to us-west-2 if AWS_REGION not set

2. Updated scripts/generate-image-manifest.sh:
   - Same dynamic ECR registry construction
   - Now respects AWS_REGION environment variable
   - Outputs region being used for transparency

3. Updated build-config.yaml documentation:
   - Added comments explaining that region values are defaults
   - Documented that AWS_REGION env var overrides config
   - Included usage example for multi-region deployment

Usage:
  # Deploy to us-east-1
  export AWS_REGION=us-east-1
  make build-push

  # Deploy to eu-west-1
  export AWS_REGION=eu-west-1
  make build-push IMAGE=registry

  # Use default (us-west-2)
  make build-push

This allows the same codebase to deploy to any AWS region without
manually editing configuration files.
…fig.yaml

Changed account ID from real value (605134468121) to placeholder (123456789012)
to ensure the build system fails fast if AWS credentials are not properly
configured, rather than accidentally working only for the original account.

This prevents the anti-pattern where:
- Works in original account without proper env var setup
- Silently fails or behaves unexpectedly in different accounts
- Masks configuration issues during multi-account deployments

The build scripts dynamically retrieve the actual account ID from:
  aws sts get-caller-identity --query Account --output text

Making this change ensures consistent behavior across all AWS accounts
and forces explicit credential configuration.
Updated terraform/aws-ecs/README.md to explicitly document that
use_regional_domains = true is the default configuration setting
in variables.tf.

Changes:
- Section title: 'Regional Domains (Recommended - DEFAULT)'
- Added '(the default)' in explanation text
- Added inline comments in terraform.tfvars examples stating it's the default
- Changed 'RECOMMENDED' to 'DEFAULT' in critical parameters section
- Clarified that static domains are an 'override' of the default

This makes it clear to users that they get regional domains automatically
(e.g., kc.us-east-1.mycorp.click) unless they explicitly opt out by
setting use_regional_domains = false.

Benefits:
- Users understand they don't need to set use_regional_domains = true
- Makes multi-region deployment pattern clear as default behavior
- Reduces confusion about which domain mode is being used
…cit instructions

Added comprehensive 'MANDATORY: Edit Required Parameters' section to clearly
guide users through required terraform.tfvars configuration before deployment.

Key improvements:

1. **Required Parameters Section** - Numbered list of 5 mandatory changes:
   - AWS Region configuration
   - Domain configuration (with first-time user guidance)
   - Container image URIs (all 7 images)
   - Network access control (ingress_cidr_blocks)
   - Keycloak credentials

2. **Domain Configuration for First-Time Users**:
   - Explicitly states use_regional_domains=true is already the default
   - No need to set it explicitly
   - Only need to change base_domain to their Route53 domain
   - Clear examples: kc.us-east-1.mycorp.click, registry.us-east-1.mycorp.click
   - Requirement: Must have domain registered with Route53

3. **Network Access Control - MANDATORY**:
   - Emphasized that ingress_cidr_blocks MUST be updated
   - Explained why: ALB security groups need it for access control
   - Without updating, services won't be accessible
   - Added helper: curl -s ifconfig.me to find IP
   - Explained /32 for single IP, /24 for ranges

4. **Quick Configuration Helper Script**:
   - Automatically retrieves AWS_ACCOUNT_ID, MY_IP, AWS_REGION
   - Displays formatted output for easy copy-paste

5. **Clear Structure**:
   - Each parameter has explanation of what and why
   - Code examples with inline comments
   - Consistent formatting throughout

This addresses common deployment confusion where users:
- Don't realize they need to edit terraform.tfvars
- Don't understand use_regional_domains is already true
- Forget to update ingress_cidr_blocks and can't access services
- Miss updating all 7 container image URIs
- Use weak default passwords in production

The mandatory section appears before the detailed deployment steps,
ensuring users configure correctly before running terraform apply.
…guration

Updated Quick Start section step 3 to explicitly warn users that they must
edit required parameters in terraform.tfvars or the installation will fail.
Added clear reference to the "MANDATORY: Edit Required Parameters" section.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added ingress_cidr_blocks variable declaration to root variables.tf and
passed it to the mcp_gateway module in main.tf. This fixes the Terraform
warning about undeclared variable.

Variable allows users to specify CIDR blocks for ALB security group access
control via terraform.tfvars.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…r regional domains

Fixed three critical issues preventing fresh deployment with regional domains:

1. Route53 Hosted Zone Lookup:
   - Added local.hosted_zone_domain to use base_domain (mycorp.click) for
     hosted zone lookups when use_regional_domains=true
   - Previously used local.root_domain (us-east-1.mycorp.click) which
     doesn't match the actual hosted zone name

2. Keycloak Client Secrets:
   - Changed from data sources to managed resources
   - Created with placeholder values during terraform apply
   - Added lifecycle.ignore_changes to allow init-keycloak.sh to update them
   - Prevents "couldn't find resource" errors on fresh deployments

3. Updated all ARN references:
   - Changed data.aws_secretsmanager_secret to aws_secretsmanager_secret
   - Updated in iam.tf and ecs-services.tf

These fixes enable successful terraform apply in new regions using regional
domain configuration.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Document the required two-stage Terraform deployment process for
first-time deployments due to SSL certificate ARN dependencies.

Changes:
- Add detailed explanation of why two stages are needed
- Document Stage 1: Create and validate SSL certificates with -target flags
- Document Stage 2: Deploy remaining infrastructure
- Clarify that subsequent deployments do NOT require two stages
- Add timing expectations (5-10 minutes for certificate validation)

This resolves the ALB listener for_each dependency issue where certificate
ARNs are not known until after ACM creates and validates certificates.
- Add flight-booking and travel-assistant A2A agent ECS services
- Fix registry_client.py to use JSON instead of form-encoded data
- Update get-m2m-token.sh to support regional Keycloak URLs
- Add CLI example files for agents and MCP servers
- Remove hardcoded Keycloak instance from terraform config
- Update build-config.yaml with agent container definitions
- Add Keycloak first login screenshot to documentation

Technical changes:
- Agents run on port 9000 with /ping health checks
- Registry client now properly serializes request data as JSON
- Keycloak configuration supports multi-region deployment
- Add time estimates to deployment stages (~10 min for Stage 2, ~30-40 min total)
- Update container image sizes to match actual ECR data (~9.8GB total across 7 images)
- Remove non-existent container images from documentation
- Make INITIAL_ADMIN_PASSWORD mandatory in init-keycloak.sh (no defaults)
- Add clear documentation of password distinction (realm admin vs master admin)
- Standardize on mycorp.click as example domain with clear replacement note
- Update all ECR image URIs to use YOUR_AWS_REGION placeholder
- Add comprehensive network access control options documentation

Security improvements:
- Require explicit INITIAL_ADMIN_PASSWORD environment variable
- Remove insecure default passwords
- Add validation checks with helpful error messages
The Docker Compose configuration uses the root directory as the build
context, so all COPY commands in the Dockerfile need to be relative to
the root, not the metrics-service directory.

Updated COPY commands to prefix paths with metrics-service/:
- pyproject.toml -> metrics-service/pyproject.toml
- app/ -> metrics-service/app/
- create_api_key.py -> metrics-service/create_api_key.py

This resolves the build failure: "not found" errors during docker build.
- Removed terraform/aws-ecs/scripts/terraform-outputs.json (contains deployment-specific outputs)
- Removed terraform/aws-ecs/terraform.tfvars.ue1 (contains region-specific configuration)
- Added patterns to .gitignore to prevent future commits of:
  - terraform-outputs.json files (environment-specific)
  - terraform.tfvars.* files (region-specific configs like .ue1, .uw2, etc)

These files contain environment and deployment-specific data that should not be version controlled.
Updated AsorFederationConfig and AnthropicFederationConfig to have
sensible defaults that allow the application to start even when the
federation.json config file is missing.

Changes:
- Set AsorFederationConfig.enabled default to False (was True)
- Set AsorFederationConfig.endpoint default to "" (was required field)
- Set AnthropicFederationConfig.enabled default to False (was True)

This fixes the startup failure:
  "ValidationError: 1 validation error for AsorFederationConfig
   endpoint: Field required"

Federation features are now opt-in rather than opt-out, which makes
more sense for deployments that don't need federation.
@aarora79 aarora79 merged commit 4812e21 into main Nov 24, 2025
7 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deploy MCP Gateway Registry on AWS ECS Fargate

3 participants