generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 60
feat: Production AWS ECS deployment with improved documentation and security #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Keycloak was enforcing HTTPS at the token endpoint even though KC_HTTP_ENABLED=true and KC_PROXY=edge were configured. The issue was missing KC_PROXY_ADDRESS_FORWARDING environment variable. When Keycloak is behind an ALB: - ALB forwards HTTP requests to Keycloak container internally - ALB sets X-Forwarded-* headers to indicate the client protocol - Without KC_PROXY_ADDRESS_FORWARDING, Keycloak ignores these headers - Result: Keycloak only sees internal HTTP and defaults to HTTPS enforcement With KC_PROXY_ADDRESS_FORWARDING=true: - Keycloak trusts the proxy headers from the ALB - Recognizes that clients are using HTTP (as per X-Forwarded-Proto) - Allows HTTP connections to OAuth2 endpoints without redirect - Enables the disable-ssl.sh script to obtain admin tokens This fix allows the Keycloak SSL disabling script to work correctly.
Corrected the frontendUrl parameter from a top-level property to the
attributes object, which is the proper way to set it via the REST API
according to the RealmRepresentation schema.
Changes:
- Moved frontendUrl from top-level to realm attributes
- Updated request body to: {"attributes": {"frontendUrl": "..."}}
- Added clarification that this may not be supported in all versions
- Maintains backward compatibility with the main sslRequired setting
The sslRequired parameter is still correct with value "none" according
to the Keycloak documentation.
…ding" This reverts commit add75db.
This reverts commit ac458fd.
This script allows disabling SSL requirements on Keycloak realms via the REST API. It can be used if SSL disabling is needed in the future. The script: - Fetches Keycloak admin password from AWS Secrets Manager - Obtains admin token via OAuth2 password grant - Disables SSL for master and mcp-gateway realms - Includes verbose logging for troubleshooting Usage: VERBOSE=1 KEYCLOAK_URL=http://keycloak:8080 ./keycloak/setup/disable-ssl.sh Note: This script is for optional use and is NOT enabled in the Keycloak ECS configuration by default.
- Add .gitignore entries for Terraform user files (terraform.tfvars, .terraform, crash logs) - Configure ALB as internet-facing with specific IP allowlists instead of 0.0.0.0/0 - Add Keycloak ALB configuration variables for separate network control - Update ingress CIDR blocks to use specific IP addresses (laptop + EC2 instance) - Improve terraform.tfvars.example with comprehensive documentation - Add Keycloak OAuth2 client secret variables to root configuration - Restrict network access to known IPs for improved security - Separate Keycloak ALB configuration from main ALB configuration
- Deleted database.tf (PostgreSQL only for Keycloak) - Removed all Keycloak secrets (database, admin, client) - Removed Keycloak ALB and listeners - Removed Keycloak ECS service (was using start-dev mode) - Removed all Keycloak variables and outputs - Removed Keycloak CloudWatch alarms - Removed RDS alarms for Keycloak database - Total: 13 files modified, 1 file deleted Files Modified: - terraform/aws-ecs/main.tf - terraform/aws-ecs/modules/mcp-gateway/ecs-services.tf - terraform/aws-ecs/modules/mcp-gateway/iam.tf - terraform/aws-ecs/modules/mcp-gateway/locals.tf - terraform/aws-ecs/modules/mcp-gateway/monitoring.tf - terraform/aws-ecs/modules/mcp-gateway/networking.tf - terraform/aws-ecs/modules/mcp-gateway/outputs.tf - terraform/aws-ecs/modules/mcp-gateway/secrets.tf - terraform/aws-ecs/modules/mcp-gateway/variables.tf - terraform/aws-ecs/outputs.tf - terraform/aws-ecs/variables.tf Files Deleted: - terraform/aws-ecs/modules/mcp-gateway/database.tf Verification Tests Passed: ✓ No Keycloak references in .tf or .tfvars files ✓ Terraform validate succeeds ✓ Terraform plan shows destructions only ✓ database.tf successfully deleted Ref: docs/keycloak-integration/keycloak-removal-checklist.md
- Uses 'start --optimized' instead of 'start-dev' - Pre-builds Keycloak for production - Copied from working aws-ecs-keycloak repository
Added files: - keycloak-database.tf: Aurora MySQL Serverless v2 + RDS Proxy - keycloak-ecs.tf: ECS service with production mode - keycloak-security-groups.tf: Security groups integrated with VPC - keycloak-alb.tf: Application Load Balancer with HTTPS - keycloak-dns.tf: Route53 zone and ACM certificate - keycloak-ecr.tf: ECR repository for Docker images - locals.tf: Common tags for all resources Key integrations: - Uses EXISTING VPC (module.vpc.vpc_id) - Production-ready configuration - Auto-validated SSL certificate - RDS Proxy for connection pooling - CloudWatch logging and monitoring - ECS auto-scaling based on CPU/memory - Deployed to us-west-2
- Create registry-dns.tf with Route 53 DNS configuration - Add A record for registry.mycorp.click pointing to main ALB - Generate ACM certificate for registry.mycorp.click - Auto-validate certificate with DNS challenge - Add registry_url, registry_certificate_arn outputs - Enables HTTPS support for main registry service
Phases completed: - Phase 0: Removed all 284 Keycloak references from broken implementation - Phase 1-2: Copied production Dockerfile (uses start --optimized) - Phase 3: Added 6 Keycloak Terraform files (968 lines) - Phase 4: Built and pushed Docker image to ECR - Phase 5: Deployed Keycloak infrastructure with Terraform - Phase 6: Configured DNS and SSL certificates New infrastructure deployed: - Keycloak: kc.mycorp.click (Aurora MySQL, ECS, ALB, ACM) - Registry: registry.mycorp.click (DNS, ACM cert, linked to main ALB) - VPC: 10.0.0.0/16 with 3 AZs - ECS Services: Keycloak, Auth Server, Registry (all running/starting) - Security: Proper security groups, IAM roles, secrets management Files created: - keycloak-database.tf: Aurora MySQL Serverless v2 + RDS Proxy - keycloak-ecs.tf: ECS service with auto-scaling - keycloak-security-groups.tf: Security group rules - keycloak-alb.tf: Application Load Balancer - keycloak-dns.tf: Route53 + ACM certificate - keycloak-ecr.tf: ECR repository - registry-dns.tf: Registry DNS + certificate - docker/keycloak/Dockerfile: Production-ready image Configuration: - Admin: admin (password in SSM Parameter Store) - Database: Serverless Aurora MySQL 0.5-2 ACU - Region: us-west-2 - Auto-scaling: Enabled for all services - Monitoring: CloudWatch alarms configured Deployment verified: - DNS resolution: Both domains resolve to ALB IPs - Keycloak health: 503 (service starting up) - ECS services: Auth Server running (2/2), Registry pending (0/2) - Security groups and certificates properly configured Next steps: - Wait for all services to reach running state - Configure Keycloak realms and clients - Link main ALB HTTPS to registry certificate - Run end-to-end authentication tests
- Created build-and-push-keycloak.sh: Automated ECR push script - Build Docker image from Dockerfile - Auto-login to AWS ECR - Tag and push to ECR with configurable tags - Verify push success - Support for custom regions/profiles - Color-coded output and error handling - Updated Makefile with Keycloak targets - make build-keycloak: Build locally - make build-and-push-keycloak: Build and push to ECR - make deploy-keycloak: Deploy to ECS - make update-keycloak: Full workflow (build+push+deploy) - Support for AWS_REGION, AWS_PROFILE, IMAGE_TAG variables - Added scripts/README.md documentation - Complete usage examples - Troubleshooting guide - Option reference - Prerequisites and features Replaces manual build/push steps with automated, repeatable process. Simplifies future Keycloak image updates and deployments.
- Created save-terraform-outputs.sh: Automated outputs export script - Exports all terraform outputs to text or JSON format - Creates formatted, readable output file - Automatic backup of previous outputs - Shows key infrastructure URLs and details - Color-coded logging and progress tracking - Updated Makefile with output export targets - make save-outputs: Export as formatted text - make save-outputs-json: Export as JSON - Added to help documentation - Generated initial terraform-outputs.txt - Documents all deployed resources - Contains all service URLs - Includes deployment summary and metadata - Ready for archival and documentation This provides clear documentation of all deployed resources and makes it easy to regenerate outputs as infrastructure changes.
- Create view-cloudwatch-logs.sh: New script to view CloudWatch logs for all ECS services (keycloak, registry, auth-server) with support for live tailing, time range filtering, and pattern matching - Simplify save-terraform-outputs.sh: Now outputs JSON-only format for better machine readability - Move build-and-push-keycloak.sh to terraform/aws-ecs/scripts/ directory for better organization - Move save-terraform-outputs.sh to terraform/aws-ecs/scripts/ directory - Create comprehensive README.md for scripts directory documenting all utilities - Update Makefile to add view-logs targets and update script paths New make targets: - make view-logs: View all component logs from last 30 minutes - make view-logs-keycloak: View Keycloak logs only - make view-logs-registry: View Registry logs only - make view-logs-auth: View Auth Server logs only - make view-logs-follow: Follow all logs in real-time Features: - CloudWatch logs script supports --minutes, --follow, --component, --filter options - All scripts have color-coded output for easy readability - Automated backup of previous terraform outputs - AWS CLI integration for fetching logs
…roxy - Changed database URL from RDS Proxy endpoint to direct RDS cluster endpoint - Added depends_on relationship between proxy target and RDS instance - This resolves the 'Communications link failure' error Keycloak was experiencing - Direct connection is more reliable for Serverless v2 Aurora
- Changed output location from terraform/ to terraform/aws-ecs/scripts/ - Updated TERRAFORM_DIR to include full path from repo root - Added OUTPUT_DIR variable pointing to script directory - Updated documentation comments to reflect new location - Tested and verified script works correctly - Output file now in same directory as the script for easier access
- Added load_from_terraform_outputs() function to read from terraform-outputs.json - Script now automatically loads ALB DNS names from saved terraform outputs - Greatly simplifies usage - only requires 3 env vars instead of 5 - Falls back gracefully if terraform-outputs.json not found or jq not available - Updated INIT-KEYCLOAK.md with simplified usage examples - Prioritizes explicitly set environment variables over JSON values
- Script now retrieves KEYCLOAK_ADMIN_PASSWORD from SSM if not set via env var - Uses AWS CLI to fetch /keycloak/admin_password parameter - Falls back to environment variable if SSM unavailable - Eliminates need to manually pass admin password when AWS credentials available - Shows helpful error message if neither source provides password - Greatly simplifies script usage in automated deployments
Major improvements: - Fix all jq parsing errors in init-keycloak.sh with proper type checking - Add token expiration handling to prevent 401 authentication errors - Create smart JWT token management with SSM caching (get-m2m-token.sh) - Add user and service management scripts for cloud deployment - Update mcp_client.py to support OAUTH_TOKEN environment variable - Add comprehensive post-deployment documentation to README - Update .gitignore for Terraform plan and backup files - Fix Keycloak database configuration and networking setup Scripts: - terraform/aws-ecs/scripts/get-m2m-token.sh: Smart token retrieval with SSM cache - terraform/aws-ecs/scripts/user_mgmt.sh: M2M and human user management - terraform/aws-ecs/scripts/service_mgmt.sh: MCP server registration - terraform/aws-ecs/scripts/init-keycloak.sh: Enhanced with robust error handling All scripts now support environment variable overrides with automatic fallback to terraform-outputs.json and AWS SSM Parameter Store.
Major changes: - Add Python registry management client with Pydantic models - Fix OAuth login by adding AUTH_SERVER_URL environment variable - Add EFS volume mount for auth server scopes.yml configuration - Fix ALB security group to allow public access (0.0.0.0/0) - Fix SSM put-parameter output contamination in get-m2m-token.sh - Add service account name fallback (service-account- prefix) - Add registry URL configuration priority (env > terraform-outputs.json > default) Python Client: - registry_client.py: Core API client with Pydantic models - registry_management.py: CLI wrapper for registry operations - Token retrieval via get-m2m-token.sh subprocess - Token redaction in logs (show only first 8 characters) - Debug logging support with --debug flag Infrastructure: - Add auth_config EFS access point for runtime configuration - Add copy-scopes-to-efs.sh script for initial setup - Update scopes.yml loading to check SCOPES_CONFIG_PATH env var - Change default ingress_cidr_blocks from specific IPs to 0.0.0.0/0 Token Management: - Fix SSM parameter double JSON parsing - Redirect aws ssm put-parameter output to prevent contamination - Support both original and service-account- prefixed client names - Check both SSM parameter locations for tokens
Added Pydantic models and methods for the A2A (Agent-to-Agent) agent management API based on docs/api-specs/a2a-agent-management.yaml. New Pydantic Models: - AgentProvider, AgentVisibility, SecuritySchemeType enums - AgentRegistration: Agent registration request - AgentCard: Agent summary view - AgentDetail: Detailed agent information - Skill, SkillDetail: Agent capability models - AgentListResponse, AgentToggleResponse: Operation responses - SkillDiscoveryRequest, AgentDiscoveryResponse: Skill-based discovery - SemanticDiscoveredAgent, AgentSemanticDiscoveryResponse: Semantic search New RegistryClient Methods: - register_agent(): Register new A2A agent - list_agents(): List agents with filtering (query, enabled_only, visibility) - get_agent(): Get detailed agent information - update_agent(): Update existing agent - delete_agent(): Remove agent from registry - toggle_agent(): Enable/disable agent - discover_agents_by_skills(): Find agents by required skills - discover_agents_semantic(): NLP semantic search using FAISS All methods include proper type hints, docstrings, and error handling for HTTP status codes (404, 403, 409, 422, 400, 500).
Extended the CLI to support all A2A (Agent-to-Agent) agent management
operations from the registry client.
New CLI Commands:
- agent-register: Register new agent from JSON config
- agent-list: List agents with filtering (query, enabled_only, visibility)
- agent-get: Get detailed agent information
- agent-update: Update existing agent from JSON config
- agent-delete: Delete agent with confirmation prompt
- agent-toggle: Enable/disable agent
- agent-discover: Discover agents by required skills (comma-separated)
- agent-search: Semantic search using natural language queries
Command Features:
- JSON config file support for registration/updates
- Automatic enum conversion (provider, visibility)
- Skill object construction from JSON
- Pretty-printed JSON output for structured data
- Confirmation prompts for destructive operations
- Support for filtering and search parameters
- Max results limits for discovery operations
Example Usage:
# Register agent
uv run python registry_management.py agent-register --config agent.json
# List enabled agents
uv run python registry_management.py agent-list --enabled-only
# Discover by skills
uv run python registry_management.py agent-discover \
--skills code_analysis,bug_detection --max-results 5
# Semantic search
uv run python registry_management.py agent-search \
--query "agents that analyze code"
Updated module docstring with comprehensive examples for all agent
management operations.
Enhanced cmd_agent_register and cmd_agent_update to handle real-world
JSON configurations more gracefully.
Improvements:
- Support both 'input_schema' and 'parameters' field names for skills
- Support both 'name' and 'id' field names for skill names
- Map provider values flexibly ('Example Corp' -> 'custom')
- Map security scheme types (OpenAPI 'http' -> 'bearer')
- Filter out extra fields not in AgentRegistration model
- Better error logging with exc_info for debugging
- Graceful handling of unknown enum values with warnings
Provider Mapping:
- Accepts 'anthropic', 'custom', 'other' (exact matches)
- Maps 'Example Corp', 'example' to 'custom'
- Unknown values default to 'custom' with warning
Security Scheme Mapping:
- Maps OpenAPI 'http' type to 'bearer'
- Supports 'bearer', 'apikey'/'api_key', 'oauth2'
- Unknown types default to 'bearer'
This allows the CLI to work with various JSON formats including
OpenAPI-style agent specifications without manual conversion.
This commit addresses multiple issues with agent registration, Keycloak configuration, and OAuth2 authentication: 1. Synchronized Pydantic models between client and server - Added missing fields to SecurityScheme model (scheme, bearer_format, in_, name, flows, openid_connect_url) - Updated Skill model: made id required, renamed input_schema to parameters, added tags - Fixed AgentListItem/AgentListResponse: moved total_count to response level, added missing fields - Fixed security_schemes transformation to preserve all fields from original JSON 2. Enhanced init-keycloak.sh with LOB groups and service accounts - Added registry-admins, registry-users-lob1, registry-users-lob2 groups - Created service account clients: registry-admin-bot, lob1-bot, lob2-bot - Created LOB users: lob1-user, lob2-user - Added proper group assignments for all entities 3. Fixed OAuth2 client secret persistence - Stored mcp-gateway-web client secret in AWS Secrets Manager - Added SSM Parameter Store backup for secret persistence - Ensures auth server survives terraform redeployments These changes fix validation errors during agent registration and prevent OAuth2 authentication failures after ECS task redeployments.
Some MCP servers (currenttime, realserverfaketools) don't allow multiple concurrent sessions on the same streamable-http endpoint. This causes tool fetches to fail with 400 Bad Request when attempted immediately after health checks. Added 0.5 second delay before tool fetch to ensure the health check session is properly closed before attempting to establish a new session for tool retrieval. This fixes the 0 tools issue for currenttime and realserverfaketools servers.
The remove endpoint expects form field named 'path' but the client was sending 'service_path'. This caused FastAPI validation to fail with 422 Unprocessable Entity because the required 'path' parameter was missing. Changed data dict key from 'service_path' to 'path' to match the endpoint's Form() parameter declaration.
Enhanced Makefile with new targets for local A2A agent development: - compose-up-agents: Start agents with docker-compose locally - compose-down-agents: Stop local agents - compose-logs-agents: Follow agent logs in real-time - build-agents: Build both agent images locally - push-agents: Push both agent images to ECR These targets simplify the workflow for developing, testing, and deploying the Flight Booking Agent and Travel Assistant Agent A2A services.
The agent Dockerfiles were copying '.' which was copying the entire
agents/a2a directory tree when the build context is agents/a2a. This caused
agent.py to not be found at the correct location.
Updated COPY commands to explicitly reference the correct source paths:
- flight-booking-agent: COPY src/flight-booking-agent/ ./
- travel-assistant-agent: COPY src/travel-assistant-agent/ ./
This ensures agent.py and all other agent code is correctly copied into the
/app directory inside the container, fixing the "Failed to spawn: agent.py"
errors in the logs.
Root cause: The Dockerfiles are located in src/{agent}/ but the build context
in build-config.yaml is agents/a2a, so relative paths needed to account for this.
The build-images.sh script was using context 'agents/a2a' which caused agent.py to be placed at /app/src/flight-booking-agent/agent.py instead of /app/agent.py, leading to "Failed to spawn: agent.py" errors in ECS. Changes: - Update build-config.yaml to use agent-specific contexts: * flight_booking_agent: agents/a2a/src/flight-booking-agent * travel_assistant_agent: agents/a2a/src/travel-assistant-agent - Update build-images.sh setup_a2a_agent() to copy dependencies from agents/a2a level to each agent's .tmp/ directory - Simplify Dockerfile comments to remove confusing dual-context notes This aligns build-images.sh behavior with docker-compose.local.yml, which also uses agent-specific contexts. Both systems now correctly place agent.py at /app/agent.py within the container. Verified all required files exist and build system is consistent.
…ectory - Created /api directory with standalone registry management scripts - Added Anthropic Registry API v0.1 client methods: - anthropic_list_servers() - List all servers - anthropic_list_server_versions() - List versions for a server - anthropic_get_server_version() - Get server details - Integrated Anthropic API commands into registry_management.py: - anthropic-list: List all servers with optional --raw JSON output - anthropic-versions: List versions for a specific server - anthropic-get: Get detailed server information - Made REGISTRY_URL environment variable mandatory in /api scripts - Removed terraform-outputs.json dependency from /api scripts - Added registry URL cascading lookup to anthropic_registry_example.py - Updated both terraform/aws-ecs/scripts versions to support Anthropic API Generated with Claude Code Co-Authored-By: Claude <[email protected]>
- Updated default efs_throughput_mode from 'provisioned' to 'bursting' - Added comment explaining bursting mode is FREE and recommended - Bursting mode provides up to 100 MiB/s which is sufficient for registry operations - Provisioned mode costs $6/MiB/s-month and should only be used when proven necessary This change will significantly reduce AWS costs for new deployments. Generated with Claude Code Co-Authored-By: Claude <[email protected]>
- Unified container builds: Update Makefile to use centralized build-config.yaml - Remove redundant build-and-push scripts (4 scripts superseded by unified build system) - Relocate client tools: Move registry_client.py and registry_management.py to /api - Improve backup management: Store terraform-outputs.json backups in .terraform/ - Rename setup-keycloak-client.sh to rotate-keycloak-web-client-secret.sh for clarity - Remove obsolete deploy-currenttime-ecs.sh (terraform handles deployments) - Remove keycloak-integration docs folder (outdated planning documents) Changes: - Makefile: Keycloak build targets now use unified build system - 8 scripts removed from terraform/aws-ecs/scripts/ - save-terraform-outputs.sh: Backups now stored in gitignored .terraform/ - rotate-keycloak-web-client-secret.sh: Improved documentation and naming
…in support
- Fixed all 5 group management API endpoints (add/remove-from-groups, create/delete-group, list-groups)
- Added missing Request parameter to all endpoints
- Fixed parameter name mismatches (server_path→server_name, groups→group_names)
- Added missing optional parameters (description, create_in_keycloak, delete_from_keycloak, force)
- Removed hardcoded us-west-2 region references (12 instances)
- init-keycloak.sh: 5 fixes
- rotate-keycloak-web-client-secret.sh: 1 fix
- user_mgmt.sh: 1 fix
- main.tf: Changed registry_image_uri to use variable
- Deleted deprecated copy-scopes-to-efs.sh
- Implemented regional domain support
- Added use_regional_domains flag with base_domain variable
- Domains now auto-generate as kc.{region}.mycorp.click and registry.{region}.mycorp.click
- Updated all terraform files to use local.keycloak_domain and local.root_domain
- Supports both regional and static domain modes
- Fixed metrics-service Dockerfile COPY paths to match build context
- Updated terraform.tfvars.example with all variables and comprehensive documentation
- Production-grade documentation (1,703 lines) for AWS ECS infrastructure - Complete architecture explanation with diagram - Step-by-step regional deployment guide - Prerequisites with tool versions and IAM policies - Post-deployment checklist (7 steps) - Container build and deployment workflows - Complete developer workflow from code to deployment - Troubleshooting guide with real error scenarios - Cost optimization strategies ($110-250/month breakdown) - Security best practices and backup procedures - Quick reference cheat sheet for common commands - All links verified, no hardcoded credentials - Ready for GitHub with professional formatting
…ariable Problem: The build-images.sh and generate-image-manifest.sh scripts were parsing the ECR registry URL directly from build-config.yaml which had a hardcoded us-west-2 region. This caused ECR authentication failures when deploying to other regions even when AWS_REGION environment variable was set. Error was: Error response from daemon: login attempt to https://605134468121.dkr.ecr.us-west-2.amazonaws.com/v2/ failed with status: 400 Bad Request Root cause: Line 50 in build-images.sh: ECR_REGISTRY=$(grep 'ecr_registry:' "$CONFIG_FILE" ...) This parsed the hardcoded us-west-2 URL from build-config.yaml regardless of the AWS_REGION environment variable. Solution: 1. Updated scripts/build-images.sh: - Construct ECR_REGISTRY dynamically using AWS_REGION env var - Changed from parsing config to: ECR_REGISTRY="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com" - Defaults to us-west-2 if AWS_REGION not set 2. Updated scripts/generate-image-manifest.sh: - Same dynamic ECR registry construction - Now respects AWS_REGION environment variable - Outputs region being used for transparency 3. Updated build-config.yaml documentation: - Added comments explaining that region values are defaults - Documented that AWS_REGION env var overrides config - Included usage example for multi-region deployment Usage: # Deploy to us-east-1 export AWS_REGION=us-east-1 make build-push # Deploy to eu-west-1 export AWS_REGION=eu-west-1 make build-push IMAGE=registry # Use default (us-west-2) make build-push This allows the same codebase to deploy to any AWS region without manually editing configuration files.
…fig.yaml Changed account ID from real value (605134468121) to placeholder (123456789012) to ensure the build system fails fast if AWS credentials are not properly configured, rather than accidentally working only for the original account. This prevents the anti-pattern where: - Works in original account without proper env var setup - Silently fails or behaves unexpectedly in different accounts - Masks configuration issues during multi-account deployments The build scripts dynamically retrieve the actual account ID from: aws sts get-caller-identity --query Account --output text Making this change ensures consistent behavior across all AWS accounts and forces explicit credential configuration.
Updated terraform/aws-ecs/README.md to explicitly document that use_regional_domains = true is the default configuration setting in variables.tf. Changes: - Section title: 'Regional Domains (Recommended - DEFAULT)' - Added '(the default)' in explanation text - Added inline comments in terraform.tfvars examples stating it's the default - Changed 'RECOMMENDED' to 'DEFAULT' in critical parameters section - Clarified that static domains are an 'override' of the default This makes it clear to users that they get regional domains automatically (e.g., kc.us-east-1.mycorp.click) unless they explicitly opt out by setting use_regional_domains = false. Benefits: - Users understand they don't need to set use_regional_domains = true - Makes multi-region deployment pattern clear as default behavior - Reduces confusion about which domain mode is being used
…cit instructions Added comprehensive 'MANDATORY: Edit Required Parameters' section to clearly guide users through required terraform.tfvars configuration before deployment. Key improvements: 1. **Required Parameters Section** - Numbered list of 5 mandatory changes: - AWS Region configuration - Domain configuration (with first-time user guidance) - Container image URIs (all 7 images) - Network access control (ingress_cidr_blocks) - Keycloak credentials 2. **Domain Configuration for First-Time Users**: - Explicitly states use_regional_domains=true is already the default - No need to set it explicitly - Only need to change base_domain to their Route53 domain - Clear examples: kc.us-east-1.mycorp.click, registry.us-east-1.mycorp.click - Requirement: Must have domain registered with Route53 3. **Network Access Control - MANDATORY**: - Emphasized that ingress_cidr_blocks MUST be updated - Explained why: ALB security groups need it for access control - Without updating, services won't be accessible - Added helper: curl -s ifconfig.me to find IP - Explained /32 for single IP, /24 for ranges 4. **Quick Configuration Helper Script**: - Automatically retrieves AWS_ACCOUNT_ID, MY_IP, AWS_REGION - Displays formatted output for easy copy-paste 5. **Clear Structure**: - Each parameter has explanation of what and why - Code examples with inline comments - Consistent formatting throughout This addresses common deployment confusion where users: - Don't realize they need to edit terraform.tfvars - Don't understand use_regional_domains is already true - Forget to update ingress_cidr_blocks and can't access services - Miss updating all 7 container image URIs - Use weak default passwords in production The mandatory section appears before the detailed deployment steps, ensuring users configure correctly before running terraform apply.
…guration Updated Quick Start section step 3 to explicitly warn users that they must edit required parameters in terraform.tfvars or the installation will fail. Added clear reference to the "MANDATORY: Edit Required Parameters" section. Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Added ingress_cidr_blocks variable declaration to root variables.tf and passed it to the mcp_gateway module in main.tf. This fixes the Terraform warning about undeclared variable. Variable allows users to specify CIDR blocks for ALB security group access control via terraform.tfvars. Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…r regional domains
Fixed three critical issues preventing fresh deployment with regional domains:
1. Route53 Hosted Zone Lookup:
- Added local.hosted_zone_domain to use base_domain (mycorp.click) for
hosted zone lookups when use_regional_domains=true
- Previously used local.root_domain (us-east-1.mycorp.click) which
doesn't match the actual hosted zone name
2. Keycloak Client Secrets:
- Changed from data sources to managed resources
- Created with placeholder values during terraform apply
- Added lifecycle.ignore_changes to allow init-keycloak.sh to update them
- Prevents "couldn't find resource" errors on fresh deployments
3. Updated all ARN references:
- Changed data.aws_secretsmanager_secret to aws_secretsmanager_secret
- Updated in iam.tf and ecs-services.tf
These fixes enable successful terraform apply in new regions using regional
domain configuration.
Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Document the required two-stage Terraform deployment process for first-time deployments due to SSL certificate ARN dependencies. Changes: - Add detailed explanation of why two stages are needed - Document Stage 1: Create and validate SSL certificates with -target flags - Document Stage 2: Deploy remaining infrastructure - Clarify that subsequent deployments do NOT require two stages - Add timing expectations (5-10 minutes for certificate validation) This resolves the ALB listener for_each dependency issue where certificate ARNs are not known until after ACM creates and validates certificates.
- Add flight-booking and travel-assistant A2A agent ECS services - Fix registry_client.py to use JSON instead of form-encoded data - Update get-m2m-token.sh to support regional Keycloak URLs - Add CLI example files for agents and MCP servers - Remove hardcoded Keycloak instance from terraform config - Update build-config.yaml with agent container definitions - Add Keycloak first login screenshot to documentation Technical changes: - Agents run on port 9000 with /ping health checks - Registry client now properly serializes request data as JSON - Keycloak configuration supports multi-region deployment
- Add time estimates to deployment stages (~10 min for Stage 2, ~30-40 min total) - Update container image sizes to match actual ECR data (~9.8GB total across 7 images) - Remove non-existent container images from documentation - Make INITIAL_ADMIN_PASSWORD mandatory in init-keycloak.sh (no defaults) - Add clear documentation of password distinction (realm admin vs master admin) - Standardize on mycorp.click as example domain with clear replacement note - Update all ECR image URIs to use YOUR_AWS_REGION placeholder - Add comprehensive network access control options documentation Security improvements: - Require explicit INITIAL_ADMIN_PASSWORD environment variable - Remove insecure default passwords - Add validation checks with helpful error messages
The Docker Compose configuration uses the root directory as the build context, so all COPY commands in the Dockerfile need to be relative to the root, not the metrics-service directory. Updated COPY commands to prefix paths with metrics-service/: - pyproject.toml -> metrics-service/pyproject.toml - app/ -> metrics-service/app/ - create_api_key.py -> metrics-service/create_api_key.py This resolves the build failure: "not found" errors during docker build.
- Removed terraform/aws-ecs/scripts/terraform-outputs.json (contains deployment-specific outputs) - Removed terraform/aws-ecs/terraform.tfvars.ue1 (contains region-specific configuration) - Added patterns to .gitignore to prevent future commits of: - terraform-outputs.json files (environment-specific) - terraform.tfvars.* files (region-specific configs like .ue1, .uw2, etc) These files contain environment and deployment-specific data that should not be version controlled.
Updated AsorFederationConfig and AnthropicFederationConfig to have sensible defaults that allow the application to start even when the federation.json config file is missing. Changes: - Set AsorFederationConfig.enabled default to False (was True) - Set AsorFederationConfig.endpoint default to "" (was required field) - Set AnthropicFederationConfig.enabled default to False (was True) This fixes the startup failure: "ValidationError: 1 validation error for AsorFederationConfig endpoint: Field required" Federation features are now opt-in rather than opt-out, which makes more sense for deployments that don't need federation.
This was referenced Nov 24, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat: Production AWS ECS deployment with improved documentation and security
Summary
This PR implements a complete production-ready AWS ECS Fargate deployment for MCP Gateway Registry with comprehensive documentation improvements and enhanced security features.
Major Features
☁️ AWS ECS Production Deployment
📚 Documentation Improvements
🔒 Security Enhancements
🛠️ Build System Improvements
build-config.yamlas single source of truth🤖 A2A Agent Deployment
Files Changed
New Files (48)
AWS ECS Terraform Infrastructure:
terraform/aws-ecs/- Complete production deployment configurationmain.tf,ecs.tf,vpc.tf,variables.tf,outputs.tfkeycloak-*.tf(ALB, database, DNS, ECR, ECS, security groups)modules/mcp-gateway/with networking, storage, monitoring, IAMterraform.tfvars.exampleScripts and Automation:
terraform/aws-ecs/scripts/- Management and initialization scriptsinit-keycloak.sh- Keycloak realm and user setupservice_mgmt.sh,user_mgmt.sh- Operations toolingview-cloudwatch-logs.sh- Log monitoringscripts/build-images.sh- Unified container build systemscripts/generate-image-manifest.sh- Image size trackingAPI and Examples:
api/- Standalone API client and management toolscli/examples/- Additional MCP server and agent examplesDocumentation:
docs/api-specs/- OpenAPI specifications for A2A, auth, and server managementModified Files (29)
Core Improvements:
README.md- Updated with ECS deployment information and roadmapMakefile- Consolidated build targets and A2A agent helpersbuild-config.yaml- Region-agnostic container image configuration.gitignore- Added terraform state and local configuration exclusionsApplication Updates:
registry/- Agent and server route improvements, auth enhancementsauth_server/- OAuth provider configuration updatesdocker/- Updated nginx configurations and new Dockerfile for scopes-initAgent Dockerfiles:
agents/a2a/src/*/Dockerfile- Fixed build context pathsBreaking Changes
init-keycloak.sh(no default fallback)YOUR_AWS_REGIONandYOUR_ACCOUNT_IDplaceholders in terraform.tfvars.exampleMigration Guide
For existing deployments:
INITIAL_ADMIN_PASSWORDenvironment variable before running init-keycloak.shTesting
Infrastructure Validation
Container Images
Security
Performance
Deployment Time Estimates
Documentation Updates
terraform/aws-ecs/README.md- Comprehensive deployment guide (1,800+ lines)README.md- Updated main README with ECS deployment informationCloses
Additional Notes
Checklist