Skip to content

datafold/terraform-aws-datafold

Repository files navigation

=======

Datafold AWS module

This repository provisions infrastructure resources on AWS for deploying Datafold using the datafold-operator.

About this module

⚠️ Important: This module is now optional. If you already have EKS infrastructure in place, you can configure the required resources independently. This module is primarily intended for customers who need to set up the complete infrastructure stack for EKS deployment.

The module provisions AWS infrastructure resources that are required for Datafold deployment. Application configuration is now managed through the datafoldapplication custom resource on the cluster using the datafold-operator, rather than through Terraform application directories.

Breaking Changes

Load Balancer Deployment (Default Changed)

Breaking Change: The load balancer is no longer deployed by default. The default behavior has been toggled to deploy_lb = false.

  • Previous behavior: Load balancer was deployed by default
  • New behavior: Load balancer deployment is disabled by default
  • Action required: If you need a load balancer, you must explicitly set deploy_lb = true in your configuration, so that you don't lose it. (in the case it does happen, you need to redeploy it and then update your DNS to the new LB CNAME).

Application Directory Removal

  • The "application" directory is no longer part of this repository
  • Application configuration is now managed through the datafoldapplication custom resource on the cluster

Prerequisites

  • An AWS account, preferably a new isolated one.
  • Terraform >= 1.4.6
  • A customer contract with Datafold
    • The application does not work without credentials supplied by sales
  • Access to our public helm-charts repository

The full deployment will create the following resources:

  • AWS VPC
  • AWS subnets
  • AWS S3 bucket for clickhouse backups
  • AWS Application Load Balancer (optional, disabled by default)
  • AWS ACM certificate (if load balancer is enabled)
  • Three EBS volumes for local data storage
  • AWS RDS Postgres database
  • An EKS cluster
  • Service accounts for the EKS cluster to perform actions outside of its cluster boundary:
    • Provisioning existing EBS volumes
    • Updating load balancer target group to point to specific pods in the cluster
    • Rescaling the nodegroup between 1-2 nodes

Infrastructure Dependencies: For a complete list of required infrastructure resources and detailed deployment guidance, see the Datafold Dedicated Cloud AWS Deployment Documentation.

Negative scope

  • This module will not provision DNS names in your zone.

How to use this module

  • See the example for a potential setup, which has dependencies on our helm-charts

Create the bucket and dynamodb table for terraform state file:

  • Use the files in bootstrap to create a terraform state bucket and a dynamodb lock table.
  • Run ./run_bootstrap.sh to create them. Enter the deployment_name when the question is asked.
    • The deployment_name is important. This is used for the k8s namespace and datadog unified logging tags and other places.
    • Suggestion: company-datafold
  • Transfer the name of that bucket and table into the backend.hcl
  • Set the target_account_profile and region where the bucket / table are stored.
  • backend.hcl is only about where the terraform state file is located.

The example directory contains a single deployment example for infrastructure setup.

Setting up the infrastructure:

  • It is easiest if you have full admin access in the target project.

  • Pre-create a symmetric encryption key that is used to encrypt/decrypt secrets of this deployment.

    • Use the alias instead of the mrk link. Put that into locals.tf
  • Certificate Requirements (depends on load balancer deployment method):

    • If deploying load balancer from this Terraform module (deploy_lb = true): Pre-create and validate the ACM certificate in your DNS, then refer to that certificate in main.tf using its domain name (Replace "datafold.acme.com")
    • If deploying load balancer from within Kubernetes: The certificate will be created automatically, but you must wait for it to become available and then validate it in your DNS after the deployment is complete
  • Change the settings in locals.tf

    • provider_region = which region you want to deploy in.
    • aws_profile = The profile you want to use to issue the deployments. Targets the deployment account.
    • kms_profile = Can be the same profile, unless you want the encryption key elsewhere.
    • kms_key = A pre-created symmetric KMS key. It's only purpose is for encryption/decryption of deployment secrets.
    • deployment_name = The name of the deployment, used in kubernetes namespace, container naming and datadog "deployment" Unified Tag)
  • Run terraform init -backend-config=../backend.hcl in the infra directory.

  • Run terraform apply in infra directory. This should complete ok.

    • Check in the console if you see the EKS cluster, RDS database, etc.
    • If you enabled load balancer deployment, check for the load balancer as well.
    • The configuration values needed for application deployment will be output to the console after the apply completes.

Application Deployment: After infrastructure is ready, deploy the application using the datafold-operator. Continue with the Datafold Helm Charts repository to deploy the operator manager and then the application through the operator. The operator is the default and recommended method for deploying Datafold.

Infrastructure Dependencies

This module is designed to provide the complete infrastructure stack for Datafold deployment. However, if you already have EKS infrastructure in place, you can choose to configure the required resources independently.

Required Infrastructure Components:

  • EKS cluster with appropriate node groups
  • RDS PostgreSQL database
  • S3 bucket for ClickHouse backups
  • EBS volumes for persistent storage (ClickHouse data, ClickHouse logs, Redis data)
  • IAM roles and service accounts for cluster operations
  • Load balancer (optional, can be managed by AWS Load Balancer Controller)
  • VPC and networking components
  • SSL certificate (validation timing depends on deployment method):
    • Terraform-managed LB: Certificate must be pre-created and validated
    • Kubernetes-managed LB: Certificate created automatically, validated post-deployment

Alternative Approaches:

  • Use this module: Provides complete infrastructure setup for new deployments
  • Use existing infrastructure: Configure required resources manually or through other means
  • Hybrid approach: Use this module for some components and existing infrastructure for others

For detailed specifications of each required component, see the Datafold Dedicated Cloud AWS Deployment Documentation. For application deployment instructions, continue with the Datafold Helm Charts repository to deploy the operator manager and then the application through the operator.

About subnets and where they get created

The module by default deploys in two availability zones. This is because by default, the subnets for private and public CIDR ranges have a list of two cidr ranges specified.

The AZ in which things get deployed depends on which AZ's get selected and in which order. This is an alphabetical ordering. In us-east this could be as many as 6 AZ's.

What the module does is sort the AZs and then it will iteratively deploy a public / private subnet specifying it's AZ in the module. Thus:

  • [10.0.0.0/24] will get deployed in us-east-1a
  • [10.0.1.0/24] will get deployed in us-east-1b

To deploy to three AZ's, you should override the public/private subnet settings. Then it will iterate across 3 elements, but the order of the AZ's will be the same by default.

You can add an "exclusion list" to the AZ ID's. The AZ ID is not the same as the AZ name. The AZ name on AWS is shuffled between their actual location across all AWS accounts. This means that your us-east-1a might be use1-az1 for you, but it might be use1-az4 for an account elsewhere. So if you need to match AZ's, you should match Availability zone ID's, not Availability zone names. The AZ ID is visible in the EC2 screen in the "settings" screen. There you see a list of enabled AZ's, their ID and their name.

To specifically select particular AZ ID's, exclude the ones you do not want in the az_id_exclude_filter. This is a list. That way, you can restrict this to only AZ's you want. Unfortunately it is an exclude filter and not an include filter. That means if AWS adds additional AZ's, it could create replacements for a future AZ.

Good news is that when there letters in use, I'd expect those letters to be maintained per AZ ID once they exist. Just for new accounts these can be shuffled all over again. So from terraform state perspective, things should be consistent at least.

Upgrading to 1.15+

In this version the terraform providers were upgraded to newer versions and this introduces role name changes and a lot of other things. This means that after the upgrade, you can expect issues with certain kube-system pods in a crashloop.

The reason this happens is that the role names have changed that infra creates. They're using a prefix and a suffix now.

AWS authenticates the service accounts for certain kube-system pods like aws-loadbalancer-controller, but after this change that role mapping breaks.

There are ways to fix that manually:

  • Apply the application again after applying the infra. This should fix the role names for two pods.
  • Go to the service account of the aws-load-balancer-controller pod.
  • The service account has a forward mapping to the role ARN they need to assume on the cloud in the annotations
  • Update that annotation.

Example:

apiVersion: v1
automountServiceAccountToken: true
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::1234567889:role/datafold-lb-controller
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: aws-load-balancer-controller
  name: aws-load-balancer-controller
  namespace: kube-system

Check kubernetes for any failing pods in the kube-system namespace, possibly these need updating in the same way if the pods continue in the crashloop backoff phase.

  • In the newest version of Amazon Linux 3, Datadog cannot determine the local hostname, which it needs for tagging. Updating to the most recent datadog operator solves this issue:
> helm repo add datadog https://helm.datadoghq.com
> helm repo udpate datadog
> helm update datafold-datadog-operator datadog/datadog-operator
  • The default version of kubernetes is now 1.33. Nodes will be replaced if you execute this upgrade.
  • The AWS LB controller must make calls to the metadata servers. But doing this from a pod means that the hop limit that is in place needs to be increased to 2. This avoids having explicit VPC ID's or regions in the configuration of the LB controller, but comes at a limited security impact:

https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html#imds-considerations

Initializing the application

After deploying the application through the operator (see the Datafold Helm Charts repository), establish a shell into the <deployment>-dfshell container. It is likely that the scheduler and server containers are crashing in a loop.

All we need to do is to run these commands:

  1. ./manage.py clickhouse create-tables
  2. ./manage.py database create-or-upgrade
  3. ./manage.py installation set-new-deployment-params

Now all containers should be up and running.

More information

You can get more information from our documentation site:

https://docs.datafold.com/datafold-deployment/dedicated-cloud/aws

Requirements

Name Version
aws >= 6.9.0
dns 3.2.1

Providers

Name Version
aws >= 6.9.0
null n/a
random n/a

Modules

Name Source Version
clickhouse_backup ./modules/clickhouse_backup n/a
database ./modules/database n/a
eks ./modules/eks n/a
github_reverse_proxy ./modules/github_reverse_proxy n/a
load_balancer ./modules/load_balancer n/a
networking ./modules/networking n/a
private_access ./modules/private_access n/a
security ./modules/security n/a
vpc_peering ./modules/vpc_peering n/a

Resources

Name Type

Inputs

Name Description Type Default Required
alb_certificate_domain Pass a domain name like example.com to this variable in order to enable ALB HTTPS listeners.
Terraform will try to find AWS certificate that is issued and matches asked domain,
so please make sure that you have issued a certificate for asked domain already.
string n/a yes
allowed_principals List of allowed principals allowed to connect to this endpoint. list(string) [] no
apply_major_upgrade Sets the flag to allow AWS to apply major upgrade on the maintenance plan schedule. bool false no
az_index Index of the availability zone number 0 no
backend_app_port The target port to use for the backend services number 80 no
ch_data_ebs_iops IOPS of EBS volume number 3000 no
ch_data_ebs_throughput Throughput of EBS volume number 1000 no
ch_logs_ebs_iops IOPS of EBS volume number 3000 no
ch_logs_ebs_throughput Throughput of EBS volume number 250 no
clickhouse_data_size EBS volume size for clickhouse data in GB number 40 no
clickhouse_logs_size EBS volume size for clickhouse logs in GB number 40 no
clickhouse_s3_bucket Bucket where clickhouse backups are stored string "clickhouse-backups-abcguo23" no
create_rds_kms_key Set to true to create a separate KMS key (Recommended). bool true no
create_ssl_cert Creates an SSL certificate if set. bool n/a yes
database_name RDS database name string "datafold" no
datadog_api_key The API key for Datadog string "" no
db_extra_parameters List of map of extra variables to apply to the RDS database parameter group list [] no
db_instance_tags The extra tags to be applied to the RDS instance. map(any) {} no
db_parameter_group_name The specific parameter group name to associate string "" no
db_parameter_group_tags The extra tags to be applied to the parameter group map(any) {} no
db_subnet_group_name The specific subnet group name to use string "" no
db_subnet_group_tags The extra tags to be applied to the parameter group map(any) {} no
default_node_disk_size Disk size for a node in GB number 40 no
deploy_github_reverse_proxy Determines that the github reverse proxy should be deployed bool false no
deploy_lb Allows a deploy without a load balancer bool true no
deploy_private_access Determines that the cluster should be 100% private bool false no
deploy_vpc_flow_logs Activates the VPC flow logs if set. bool false no
deploy_vpc_peering Determines that the VPC peering should be deployed bool false no
deployment_name Name of the current deployment. string n/a yes
dhcp_options_domain_name Specifies DNS name for DHCP options set string "" no
dhcp_options_domain_name_servers Specify a list of DNS server addresses for DHCP options set list(string)
[
"AmazonProvidedDNS"
]
no
dhcp_options_tags Tags applied to the DHCP options set. map(string) {} no
dns_egress_cidrs List of Internet addresses to which the application has access list(string) [] no
ebs_extra_tags The extra tags to be applied to the EBS volumes map(any) {} no
ebs_type Type for all EBS volumes string "gp3" no
enable_dhcp_options Flag to use custom DHCP options for DNS resolution. bool false no
environment Global environment tag to apply on all datadog logs, metrics, etc. string n/a yes
github_cidrs List of CIDRs that are allowed to connect to the github reverse proxy list(string) [] no
host_override Overrides the default domain name used to send links in invite emails and page links. Useful if the application is behind cloudflare for example. string "" no
ingress_enable_http_sg Whether regular HTTP traffic should be allowed to access the load balancer bool false no
initial_apply_complete Indicates if this infra is deployed or not. Helps to resolve dependencies. bool false no
k8s_access_bedrock Allow cluster to access bedrock in this region bool false no
k8s_api_access_roles Set of roles that are allowed to access the EKS API set(string) [] no
k8s_cluster_version Ref. https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html string "1.33" no
k8s_module_version EKS terraform module version string "~> 19.7" no
k8s_public_access_cidrs List of CIDRs that are allowed to connect to the EKS control plane list(string) n/a yes
lb_access_logs Load balancer access logs configuration. map(string) {} no
lb_deletion_protection Flag if the load balancer can be deleted or not. bool true no
lb_deploy_nlb Flag if the network load balancer should be deployed (usually for incoming private link). bool false no
lb_idle_timeout The time in seconds that the connection is allowed to be idle. number 120 no
lb_internal Set to true to make the load balancer internal and not exposed to the internet. bool false no
lb_name_override An optional override for the name of the load balancer string "" no
lb_nlb_internal Set to true to make the load balancer internal and not exposed to the internet. bool true no
lb_subnets_override Override subnets to deploy ALB into, otherwise use default logic. list(string) [] no
lb_vpces_details Endpoint service to define for internal traffic over private link
object({
allowed_principals = list(string)
private_dns_name = optional(string)
acceptance_required = bool

supported_ip_address_types = list(string)
})
null no
managed_node_grp1 Ref. https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/eks-managed-node-group any n/a yes
managed_node_grp2 Ref. https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/eks-managed-node-group any null no
managed_node_grp3 Ref. https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/eks-managed-node-group any null no
monitor_lambda_datadog Whether to monitor the Lambda with Datadog bool false no
nat_gateway_public_ip Public IP of the NAT gateway when reusing the NAT gateway instead of recreating string "" no
peer_region The region of the peer VPC string "" no
peer_vpc_additional_whitelisted_ingress_cidrs List of CIDRs that can pass through the load balancer set(string) [] no
peer_vpc_cidr_block The CIDR block of the peer VPC string "" no
peer_vpc_id The VPC ID to peer with string "" no
peer_vpc_owner_id The AWS account ID of the owner of the peer VPC string "" no
private_subnet_index Index of the private subnet number 0 no
private_subnet_tags The extra tags to be applied to the private subnets map(any)
{
"Tier": "private"
}
no
propagate_intra_route_tables_vgw If intra subnets should propagate traffic. bool false no
propagate_private_route_tables_vgw If private subnets should propagate traffic. bool false no
propagate_public_route_tables_vgw If public subnets should propagate traffic. bool false no
provider_azs List of availability zones to consider. If empty, the modules will determine this dynamically. list(string) [] no
provider_region The AWS region in which the infrastructure should be deployed string n/a yes
public_subnet_index Index of the public subnet number 0 no
public_subnet_tags The extra tags to be applied to the public subnets map(any)
{
"Tier": "public"
}
no
rds_allocated_storage The size of RDS allocated storage in GB number 20 no
rds_auto_minor_version_upgrade Sets a flag to upgrade automatically all minor versions bool false no
rds_backup_window RDS backup window string "03:00-06:00" no
rds_backups_replication_retention_period RDS backup replication retention period number 14 no
rds_backups_replication_target_region RDS backup replication target region string null no
rds_copy_tags_to_snapshot To copy tags to snapshot or not bool false no
rds_extra_tags The extra tags to be applied to the RDS instance map(any) {} no
rds_identifier Name of the RDS instance string "" no
rds_instance EC2 insance type for PostgreSQL RDS database.
Available instance groups: t3, m4, m5, r6i, m6i
Available instance classes: medium and higher.
string "db.t3.medium" no
rds_kms_key_alias RDS KMS key alias. string "datafold-rds" no
rds_maintenance_window RDS maintenance window string "Mon:00:00-Mon:03:00" no
rds_max_allocated_storage The upper limit the database can grow in GB number 100 no
rds_monitoring_interval RDS monitoring interval number 0 no
rds_monitoring_role_arn The IAM role allowed to send RDS metrics to cloudwatch string null no
rds_multi_az RDS instance in multiple AZ's bool false no
rds_param_group_family The DB parameter group family to use string "postgres15" no
rds_password_override Password override string null no
rds_performance_insights_enabled RDS performance insights enabled or not bool false no
rds_performance_insights_retention_period RDS performance insights retention period number 7 no
rds_port The port the RDS database should be listening on. number 5432 no
rds_ro_username RDS read-only user name (not currently used). string "datafold_ro" no
rds_username Overrides the default RDS user name that is provisioned. string "datafold" no
rds_version Postgres RDS version to use. string "15.5" no
redis_data_size Redis EBS volume size in GB number 50 no
redis_ebs_iops IOPS of EBS redis volume number 3000 no
redis_ebs_throughput Throughput of EBS redis volume number 125 no
s3_backup_bucket_name_override Bucket name override. string "" no
s3_clickhouse_backup_tags The extra tags to be applied to the S3 clickhouse backup bucket map(any) {} no
self_managed_node_grp_instance_type Ref. https://github.com/awslabs/amazon-eks-ami/blob/master/files/eni-max-pods.txt string "THe instance type for the self managed node group." no
self_managed_node_grps Ref. https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest/submodules/self-managed-node-group any {} no
service_account_prefix Prefix for service account names to match Helm chart naming (e.g., 'datafold-' for 'datafold-server', or '' for no prefix) string "datafold-" no
sg_tags The extra tags to be applied to the security group map(any) {} no
tags Tags to apply to the general module any {} no
use_default_rds_kms_key Flag weither or not to use the default RDS KMS encryption key. Not recommended. bool false no
vpc_cidr The CIDR of the new VPC, if the vpc_cidr is not set string "10.0.0.0/16" no
vpc_exclude_az_ids AZ IDs to exclude from availability zones list(string) [] no
vpc_id The VPC ID of an existing VPC to deploy the cluster in. Creates a new VPC if not set. string "" no
vpc_private_subnets The private subnet CIDR ranges when a new VPC is created. list(string)
[
"10.0.0.0/24",
"10.0.1.0/24"
]
no
vpc_propagating_vgws ID's of virtual private gateways to propagate. list(any) [] no
vpc_public_subnets The public network CIDR ranges list(string)
[
"10.0.100.0/24",
"10.0.101.0/24"
]
no
vpc_tags The extra tags to be applied to the VPC map(any) {} no
vpc_vpn_gateway_id ID of the VPN gateway to attach to the VPC string "" no
vpce_details Endpoint names to define with security group rule definitions
map(object({
vpces_service_name = string
subnet_ids = optional(list(string), [])
private_dns_enabled = optional(bool, true)

input_rules = list(object({
description = string
from_port = number
to_port = number
protocol = string
cidr_blocks = string
}))
output_rules = list(object({
description = string
from_port = number
to_port = number
protocol = string
cidr_blocks = string
}))
}))
{} no
vpn_cidr CIDR range for administrative access string "" no
whitelisted_egress_cidrs List of Internet addresses the application can access going outside list(string) n/a yes
whitelisted_ingress_cidrs List of CIDRs that can pass through the load balancer list(string) n/a yes

Outputs

Name Description
clickhouse_backup_role_name The name of the role for clickhouse backups
clickhouse_data_size The size in GB of the clickhouse EBS data volume
clickhouse_data_volume_id The EBS volume ID where clickhouse data will be stored.
clickhouse_logs_size The size in GB of the clickhouse EBS logs volume
clickhouse_logs_volume_id The EBS volume ID where clickhouse logs will be stored.
clickhouse_password The generated clickhouse password to be used in the application deployment
clickhouse_s3_bucket The location of the S3 bucket where clickhouse backups are stored
clickhouse_s3_region The region where the S3 bucket is created
cloud_provider A string describing the type of cloud provider to be passed onto the helm charts
cluster_endpoint The URL to the EKS cluster endpoint
cluster_name The name of the EKS cluster
cluster_scaler_role_arn The ARN of the role that is able to scale the EKS cluster nodes.
db_instance_id The ID of the RDS database instance
deployment_name The name of the deployment
dfshell_role_arn The ARN of the AWS Bedrock role
dfshell_service_account_name The name of the service account for dfshell
dma_role_arn The ARN of the AWS Bedrock role
dma_service_account_name The name of the service account for dma
domain_name The domain name to be used in DNS configuration
github_reverse_proxy_url The URL of the API Gateway that acts as a reverse proxy to the GitHub API
k8s_load_balancer_controller_role_arn The ARN of the role provisioned so the k8s cluster can edit the target group through the AWS load balancer controller.
lb_name The name of the external load balancer
load_balancer_ips The load balancer IP when it was provisioned.
operator_role_arn The ARN of the AWS Bedrock role
operator_service_account_name The name of the service account for operator
postgres_database_name The name of the pre-provisioned database.
postgres_host The DNS name for the postgres database
postgres_password The generated postgres password to be used by the application
postgres_port The port configured for the RDS database
postgres_username The postgres username to be used by the application
private_access_vpces_name Name of the VPCE service that allows private access to the cluster endpoint
redis_data_size The size in GB of the Redis data volume.
redis_data_volume_id The EBS volume ID of the Redis data volume.
redis_password The generated redis password to be used in the application deployment
scheduler_role_arn The ARN of the AWS Bedrock role
scheduler_service_account_name The name of the service account for scheduler
security_group_id The security group ID managing ingress from the load balancer
server_role_arn The ARN of the AWS Bedrock role
server_service_account_name The name of the service account for server
storage_worker_role_arn The ARN of the AWS Bedrock role
storage_worker_service_account_name The name of the service account for storage_worker
target_group_arn The ARN to the target group where the pods need to be registered as targets.
vpc_cidr The CIDR of the entire VPC
vpc_id The ID of the VPC
vpces_azs Set of availability zones where the VPCES is available.
worker_catalog_role_arn The ARN of the AWS Bedrock role
worker_catalog_service_account_name The name of the service account for worker_catalog
worker_interactive_role_arn The ARN of the AWS Bedrock role
worker_interactive_service_account_name The name of the service account for worker_interactive
worker_lineage_role_arn The ARN of the AWS Bedrock role
worker_lineage_service_account_name The name of the service account for worker_lineage
worker_monitor_role_arn The ARN of the AWS Bedrock role
worker_monitor_service_account_name The name of the service account for worker_monitor
worker_portal_role_arn The ARN of the AWS Bedrock role
worker_portal_service_account_name The name of the service account for worker_portal
worker_role_arn The ARN of the AWS Bedrock role
worker_service_account_name The name of the service account for worker
worker_singletons_role_arn The ARN of the AWS Bedrock role
worker_singletons_service_account_name The name of the service account for worker_singletons

About

A terraform module for deploying the Datafold infrastructure on AWS.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5