Skip to content

Conversation

@cnmcavoy
Copy link
Contributor

@cnmcavoy cnmcavoy commented Sep 27, 2022

What type of PR is this?
/kind feature

What this PR does / why we need it:
Enhances the EKSConfigTemplate cloud-init to expose many more userdata features when bootstrapping EKS nodes in a EKS managed cluster. CAPI already exposes these configuration in the KubeadmConfigTemplate. Feature parity makes it easier for users coming from a self-managed clusters to switch and adopt EKS clusters.

Enhances the EKSConfigTemplate cloud-init, which currently uses a bash-style cloud-init user data to the YAML-based cloud-init data format, which has exposes more capabilities than can easily be accomplished with bash. CAPI's KubeadmConfigTemplate uses the YAML data format of cloud-init, which allows it to have additional capabilities in disk, mount, and filesystem layouts, amongst other things. This PR adds the same capabilities to CAPA and exposes them as configuration in the EKSConfigTemplate.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3756

Special notes for your reviewer:
This PR was rebased off of an existing PR, #3341, which exposed more commands to the user, but did not switch the cloud-init away from a bash format.

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority labels Sep 27, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @cnmcavoy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 27, 2022
@richardcase
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 28, 2022
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/cloud-init branch 2 times, most recently from 42f588f to 9182683 Compare September 28, 2022 18:44
@cnmcavoy cnmcavoy changed the title WIP: Enhance the EKSConfigTemplate and add configuration for files, mounts, users, ntp, etc for CAPI feature parity Enhance the EKSConfigTemplate and add configuration for files, mounts, users, ntp, etc for CAPI feature parity Sep 28, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2022
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/cloud-init branch 5 times, most recently from cc8c765 to ac91529 Compare October 4, 2022 21:01
@Skarlso
Copy link
Contributor

Skarlso commented Oct 7, 2022

/assign

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the line above, but what's missing here is to convert these values:

// WARNING: in.PreBootstrapCommands requires manual conversion: does not exist in peer-type
	// WARNING: in.PostBootstrapCommands requires manual conversion: does not exist in peer-type
	// WARNING: in.BootstrapCommandOverride requires manual conversion: does not exist in peer-type
	// WARNING: in.Files requires manual conversion: does not exist in peer-type
	// WARNING: in.DiskSetup requires manual conversion: does not exist in peer-type
	// WARNING: in.Mounts requires manual conversion: does not exist in peer-type
	// WARNING: in.Users requires manual conversion: does not exist in peer-type
	// WARNING: in.NTP requires manual conversion: does not exist in peer-type

These should be done manually in the above ConvertTo and ConvertFrom methods.

For example, like here: https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3741/files#diff-a86bb4e63de0ea453a7d13e98a2d9b8b7de1181d7bfa542a67e76e7d711678baR36

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing an example, I was able to fill in the ConvertTo and ConvertFrom and restore the conversion_test that I deleted to make the tests pass earlier.

Are there any additional docs on this conversion behavior? I still feel like there's a bit happening under the covers in the conversion that I don't follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, they are converted to annotations and then those annotations are restored to actual values. :)

Yeah, the conversion Fuzzy test makes sure that the conversion ( when people upgrade for example ) work in these scenarios.

Copy link
Contributor

@Skarlso Skarlso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple little changes. This looks okay for me so far. Can you please add an e2e test for this? I can help you with how to write it. :)

Alternatively, you can create a follow up ticket to write an e2e test, in that case, please include a manual run and it's output showing that it actually works.

@cnmcavoy cnmcavoy force-pushed the cnmcavoy/cloud-init branch from ac91529 to 5bbad9f Compare October 7, 2022 20:08
@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Oct 7, 2022

Just a couple little changes. This looks okay for me so far. Can you please add an e2e test for this? I can help you with how to write it. :)

I'll gladly write an e2e test for this, but I'm not familiar with how the e2e testing works here in general so I'd need some hand holding in testing locally and writing them. To not block this, I lean towards adding the e2e test in a separate ticket.

Alternatively, you can create a follow up ticket to write an e2e test, in that case, please include a manual run and it's output showing that it actually works.

What kind of output are you looking for? The controller doesn't log anything new, but here's an example EKSConfigTemplate which I successfully used with a MachineDeployment to create EKS workers:

apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfigTemplate
metadata:
  annotations:
  creationTimestamp: "2022-09-28T19:09:35Z"
  generation: 1
  name: rad-worker-az1-bd8c2fb9
  namespace: capi-ekscmhdev
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    name: ekscmhdev
    uid: 442b7e5e-85b6-43fb-a2b8-6b6f27dd4f03
  resourceVersion: "172425"
  uid: 9daddfce-2505-4972-9bdf-7af2c745184a
spec:
  template:
    spec:
      containerRuntime: containerd
      diskSetup:
        filesystems:
        - device: /dev/sdb
          filesystem: ext4
          label: RAD
        partitions:
        - device: /dev/sdb
          layout: true
      files:
      - content: |
          vm.max_map_count=262144
        path: /etc/sysctl.d/90-vm-max-map-count.conf
      - content: |
          fs.inotify.max_user_instances=256
        path: /etc/sysctl.d/91-fs-inotify.conf
      kubeletExtraArgs:
        cloud-provider: aws
        event-qps: "10"
        eviction-hard: memory.available<500Mi,nodefs.available<10%
        kube-reserved: cpu=500m,memory=2Gi,ephemeral-storage=1Gi
        node-labels: role.node.kubernetes.io/worker=true,role.node.kubernetes.io/rad-worker=true,network.node.indeed.com/pods=flat,node.indeed.com/node-group=rad-worker-az1
        protect-kernel-defaults: "true"
        register-with-taints: role.node.kubernetes.io/rad-worker=true:NoSchedule
        system-reserved: cpu=500m,memory=1Gi,ephemeral-storage=1Gi
      mounts:
      - - LABEL=RAD
        - /var/lucy
        - ext4
        - defaults
      preBootstrapCommands:
      - 'TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds:
        21600" 2>/tmp/token)'
      - 'INSTANCE_ID=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-id
        2>/tmp/instance)'
      - 'REGION=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/placement/region
        2>/tmp/region)'
      - aws ec2 modify-instance-metadata-options --http-put-response-hop-limit 2 --region
        ${REGION} --instance-id ${INSTANCE_ID}
      - chown 1505:1505 /var/lucy
      - chmod 2755 /var/lucy
      - mkdir -p /var/lucy/_rad/headwater2
      - mkdir -p /var/lucy/_rad/delta2
      - mkdir -p /var/lucy/_rad/sandboxes
      - chown -R 1505:1505 /var/lucy/_rad
      - chmod 2775 /var/lucy/_rad/sandboxes
      - sudo systemctl restart systemd-sysctl

@Skarlso
Copy link
Contributor

Skarlso commented Oct 8, 2022

For output, I'm looking for something like the actual set of values applied to the cluster and the node. So I would assume a describe of the Node would show the set values.

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Oct 10, 2022

For output, I'm looking for something like the actual set of values applied to the cluster and the node. So I would assume a describe of the Node would show the set values.

I dumped capi/capa resources and nodes: https://gist.github.com/cnmcavoy/52457c07ccdf07b388bf62a731c72e27

We use a helm chart internally to generate the Cluster, MachineDeployments, and other resources. I can include the output of this chart too if desired.

@cnmcavoy cnmcavoy requested review from Skarlso and removed request for sedefsavas October 17, 2022 16:08
@Skarlso
Copy link
Contributor

Skarlso commented Oct 18, 2022

@cnmcavoy So looking at this output, I have no idea what values you set. :)) Can you also please share your config file?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 20, 2022
Signed-off-by: Richard Case <[email protected]>
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/cloud-init branch from 5bbad9f to c869d47 Compare October 24, 2022 17:42
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2022
@cnmcavoy
Copy link
Contributor Author

@cnmcavoy So looking at this output, I have no idea what values you set. :)) Can you also please share your config file?

Not sure what config file you are referencing. We use a custom helm chart to generate CAPI resources for each cluster, since we started using CAPI in the 0.3 days. I created a gist that also shows the output of the chart, which are effectively the inputs to CAPI: https://gist.github.com/cnmcavoy/b93420d14fc38cfe6bb080fcaac92032

@Skarlso
Copy link
Contributor

Skarlso commented Oct 25, 2022

Okay so I might be misunderstanding something. But I thought that with this pr you are enhancing the ability to configure bootstrapping nodes through making the bootstrap script configurable. Adding pre and post bootstrap scripts as well. What I'm looking for is using this ability to set something to see that it works.

@Skarlso
Copy link
Contributor

Skarlso commented Oct 26, 2022

/lgtm
With one remark.

@richardcase If you still want to take do a read on this. :)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2022
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/cloud-init branch from c869d47 to 9b8b5af Compare October 26, 2022 16:14
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2022
@Skarlso
Copy link
Contributor

Skarlso commented Oct 27, 2022

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Skarlso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2022
@Skarlso
Copy link
Contributor

Skarlso commented Oct 27, 2022

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 27, 2022
@Skarlso
Copy link
Contributor

Skarlso commented Oct 27, 2022

/test ?

@k8s-ci-robot
Copy link
Contributor

@Skarlso: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Skarlso
Copy link
Contributor

Skarlso commented Oct 27, 2022

/test pull-cluster-api-provider-aws-e2e

@Skarlso
Copy link
Contributor

Skarlso commented Oct 27, 2022

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 27, 2022
@k8s-ci-robot k8s-ci-robot merged commit ad1231a into kubernetes-sigs:main Oct 27, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.x milestone Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance cloud-init to use the data-format and add missing capabilities for parity w/CAPI

4 participants