Skip to content

Conversation

marun
Copy link

@marun marun commented Apr 28, 2021

OpenShift since 3.x has injected the service serving certificate ca (service ca) bundle into service account token secrets. This was intended to ensure that all pods would be able to easily verify connections to endpoints secured with service serving certificates. Since breaking customer workloads is not an option, and there is no way to ensure that customers are not relying on the service ca bundle being mounted at /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt, it is necessary to continue mounting the service ca bundle in the same location in the bound token projected volumes enabled by the BoundServiceAccountTokenVolume feature (enabled by default in 1.21).

A new controller is added to create a configmap per namespace that is annotated for service ca injection. The controller is derived from the controller that creates configmaps for the root ca. The service account admission controller is updated to include a source for the new configmap in the default projected volume definition.

@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Apr 28, 2021
@openshift-ci-robot
Copy link

@marun: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Apr 28, 2021
@openshift-ci-robot
Copy link

@marun: This pull request references Bugzilla bug 1946479, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1946479: Re-enable BoundServiceAccountTokenVolume disabled by 1.21 rebase

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Apr 28, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2021
@marun
Copy link
Author

marun commented Apr 28, 2021

/retest

@marun marun force-pushed the fix-bound-service-account-token-volume branch from db490ee to 3b1bac0 Compare April 28, 2021 14:25
@openshift-ci-robot
Copy link

@marun: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

@openshift-ci-robot
Copy link

@marun: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare

// if we have a rootCA bundle add that too. The rootCA will be used when hitting the default master service, since those are signed
// using a different CA by default. The rootCA's key is more closely guarded than ours and if it is compromised, that power could
// be used to change the trusted signers for every pod anyway, so we're already effectively trusting it.
if len(controllerOptions.RootCA) > 0 {
controllerOptions.ServiceServingCA = append(controllerOptions.ServiceServingCA, controllerOptions.RootCA...)
controllerOptions.ServiceServingCA = append(controllerOptions.ServiceServingCA, []byte("\n")...)
}
controllerOptions.ServiceServingCA = append(controllerOptions.ServiceServingCA, serviceServingCA...)
, and the surrounding code. Primarily the inclusion of the “rootCA”, and also that the data is read from files and not the API.

(I have no idea whether the old behavior is correct, I just want the difference to be an intentional decision.)

Comment on lines 360 to 361
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m sure this is a stupid question, I hope to learn something: why isn’t a natural way to do this is to change the generated projected volume in

Sources: []api.VolumeProjection{
? At least the service-ca.crt file conflict would go away entirely.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answering myself:

projected volumes only support local object references

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this train of thought, the old token controller creating secrets that include service-ca.crt is still running with BoundServiceAccountTokenVolume enabled. So #724 reproduces the old contents of that file (including the root CAs), and is pretty close to the trade-offs of the old approach, and at least survives bootstrap. OTOH “pretty close to the old approach” is probably a good reason not to do it that way.

@marun marun force-pushed the fix-bound-service-account-token-volume branch from d3cb32d to ce2e4a4 Compare May 10, 2021 22:14
@openshift-ci-robot
Copy link

@marun: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

@openshift-ci
Copy link

openshift-ci bot commented May 10, 2021

@marun: This pull request references Bugzilla bug 1946479, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1946479: Re-enable BoundServiceAccountTokenVolume disabled by 1.21 rebase

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marun marun force-pushed the fix-bound-service-account-token-volume branch from ce2e4a4 to 1cadcb0 Compare May 10, 2021 22:26
@openshift-ci-robot
Copy link

@marun: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

@openshift-ci
Copy link

openshift-ci bot commented May 10, 2021

@marun: This pull request references Bugzilla bug 1946479, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1946479: Re-enable BoundServiceAccountTokenVolume disabled by 1.21 rebase

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@marun marun force-pushed the fix-bound-service-account-token-volume branch from 1cadcb0 to f2aa03c Compare May 10, 2021 23:42
@openshift-ci-robot
Copy link

@marun: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

@marun marun force-pushed the fix-bound-service-account-token-volume branch from f2aa03c to f0f6d79 Compare May 10, 2021 23:53
@openshift-ci-robot
Copy link

@marun: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

@marun marun force-pushed the fix-bound-service-account-token-volume branch from f0f6d79 to 921a590 Compare May 13, 2021 02:53
@openshift-ci-robot
Copy link

@marun: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

@marun
Copy link
Author

marun commented May 13, 2021

/retest

1 similar comment
@marun
Copy link
Author

marun commented May 14, 2021

/retest

@s-urbaniak
Copy link

s-urbaniak commented Jun 8, 2021

I have a (hopefully not) red herring. kube-controller-manager-operator fails to upgrade on a local cluster because:

$ kubectl -n openshift-kube-controller-manager-operator describe pod kube-controller-manager-operator-56697fcbb7-xkx5q 
Name:                 kube-controller-manager-operator-56697fcbb7-xkx5q
Namespace:            openshift-kube-controller-manager-operator
...
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    14m                  default-scheduler  Successfully assigned openshift-kube-controller-manager-operator/kube-controller-manager-operator-56697fcbb7-xkx5q to ip-10-0-169-76.eu-west-3.compute.internal
  Warning  FailedMount  9m59s                kubelet            Unable to attach or mount volumes: unmounted volumes=[kube-api-access-jkt6h], unattached volumes=[kube-api-access-jkt6h config serving-cert]: timed out waiting for the condition
  Warning  FailedMount  3m15s (x4 over 12m)  kubelet            Unable to attach or mount volumes: unmounted volumes=[kube-api-access-jkt6h], unattached volumes=[config serving-cert kube-api-access-jkt6h]: timed out waiting for the condition
  Warning  FailedMount  117s (x14 over 14m)  kubelet            MountVolume.SetUp failed for volume "kube-api-access-jkt6h" : configmap "openshift-service-ca.crt" not found

It looks like a race between the operator and the operand. The former is on a newer version but fails to mount the new service CA while the latter is not updated yet and does not provision the CA in the first place.

I think we simply have to create an empty initial service CA configmap.

@stlaz
Copy link

stlaz commented Jun 8, 2021

/lgtm cancel
as per ivestigation with @s-urbaniak, this will brick cluster upgrades because KCMs only update AFTER KAS, but KAS already starts injecting the configmap at that point.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2021
@s-urbaniak
Copy link

The chain of events in an upgrade situation is as follows, if events happen in the following order:

  1. kube-apiserver is upgraded to a newer version enabling the projection of the openshift-ca.crt for all service accounts
  2. kube-controller-manager-operator is upgraded to a newer version, the new apiserver admission adds the projected mount to the pod spec
  3. kube-controller-manager-operator never succeeds to start as the new projected mount volume for the openshift-ca.crt is not provisioned by the still running old kube-controller-manager.
  4. The new kube-controller-manager (operand) will never start because its operator never starts

@s-urbaniak
Copy link

s-urbaniak commented Jun 8, 2021

we have another failure mode which explains failed e2e upgrade run failures like this one https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/714/pull-ci-openshift-kubernetes-master-e2e-gcp-upgrade/1402105642675605504:

: operator conditions kube-apiserver expand_less0sOperator degraded (InstallerPodContainerWaiting_ContainerCreating): InstallerPodContainerWaitingDegraded: Pod "installer-9-ci-op-vtmfiht6-3bd46-28x7p-master-1" on node "ci-op-vtmfiht6-3bd46-28x7p-master-1" container "installer" is waiting since 2021-06-08 04:48:08 +0000 UTC because ContainerCreating | : operator conditions kube-apiserver expand_less | 0s | Operator degraded (InstallerPodContainerWaiting_ContainerCreating): InstallerPodContainerWaitingDegraded: Pod "installer-9-ci-op-vtmfiht6-3bd46-28x7p-master-1" on node "ci-op-vtmfiht6-3bd46-28x7p-master-1" container "installer" is waiting since 2021-06-08 04:48:08 +0000 UTC because ContainerCreating
: operator conditions kube-apiserver expand_less | 0s
Operator degraded (InstallerPodContainerWaiting_ContainerCreating): InstallerPodContainerWaitingDegraded: Pod "installer-9-ci-op-vtmfiht6-3bd46-28x7p-master-1" on node "ci-op-vtmfiht6-3bd46-28x7p-master-1" container "installer" is waiting since 2021-06-08 04:48:08 +0000 UTC because ContainerCreating

I checked and verified that installer-9-ci-op-vtmfiht6-3bd46-28x7p-master-1 is also provisioned with the new projected service account mounts.

Hence, the upgrade failure can also happen during upgrades of apiserver already, if at least one (new) apiserver instance is up.

@s-urbaniak
Copy link

As discussed OOB the resolution is to manually craft serviceaccount mounts for certain control plane pods which omit serviceaccount admission:

  • installer pods
  • kube-apiserver pods
  • kube-controller-manager-operator
  • kube-controller-manager

@stlaz
Copy link

stlaz commented Jun 9, 2021

/retest
This might pass now in theory.

@s-urbaniak
Copy link

/test e2e-gcp-upgrade

@s-urbaniak
Copy link

/test e2e-gcp-upgrade

@stlaz
Copy link

stlaz commented Jun 10, 2021

/lgtm
/retest
looks like the issues have been solved

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2021
@stlaz
Copy link

stlaz commented Jun 10, 2021

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 10, 2021
@openshift-ci
Copy link

openshift-ci bot commented Jun 10, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, marun, s-urbaniak, soltysh, stlaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mfojtik
Copy link

mfojtik commented Jun 10, 2021

/override ci/prow/e2e-gcp-upgrade

@mfojtik mfojtik merged commit a5ec692 into openshift:master Jun 10, 2021
@openshift-ci
Copy link

openshift-ci bot commented Jun 10, 2021

@mfojtik: Overrode contexts on behalf of mfojtik: ci/prow/e2e-gcp-upgrade

In response to this:

/override ci/prow/e2e-gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Jun 10, 2021

@marun: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 1946479 has not been moved to the MODIFIED state.

In response to this:

Bug 1946479: Re-enable BoundServiceAccountTokenVolume disabled by 1.21 rebase

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants