Skip to content

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Jun 18, 2025

What does this PR do?

This PR introduces the following changes:

  1. Vendor external Kubernetes artifacts for testing:

    • Kube-state-metrics (KSM) Helm subchart is now vendored to eliminate remote fetches during k8s integration tests.
    • Rendered manifests from the kustomize configuration are vendored to remove the runtime dependency on kustomize.
    • The kube-stack Helm chart is also vendored to avoid network pulls.

    To support this, a new mage target called integration:buildKubernetesTestData was added. This target is now invoked as a prerequisite by:

    • integration:testKubernetes
    • integration:testKubernetesMatrix
    • integration:testKubernetesSingle

    This aims to prevent CI failures from GitHub/network issues and addresses #8319.

  2. Fix decoding of Beats-style API Keys in K8s tests:

    • PR #7754 introduced a regression that broke Beats-style API key generation required by the Helm and Kustomize k8s integration tests.
    • This PR patches the decoding logic, and highlights the need to validate event ingestion in all integration tests.
  3. Fix %CA_TRUSTED% environment variable injection in Kustomize tests:

    • Recent failures revealed malformed ca_trusted_fingerprint values caused by %CA_TRUSTED% placeholders not being overridden.
    • This PR sets CA_TRUSTED to an empty value in the relevant test environments to ensure expected TLS behavior.
    {"log.level":"info","@timestamp":"2025-06-18T02:40:53.068Z","message":"'ca_trusted_fingerprint' set, looking for matching fingerprints","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-default","type":"filestream"},"log":{"source":"filestream-default"},"log.logger":"tls","log.origin":{"file.line":180,"file.name":"tlscommon/tls_config.go","function":"github.com/elastic/elastic-agent-libs/transport/tlscommon.trustRootCA"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
    {"log.level":"error","@timestamp":"2025-06-18T02:40:53.068Z","message":"Error dialing decode 'ca_trusted_fingerprint': encoding/hex: invalid byte: U+0025 '%'","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-default","type":"filestream"},"log":{"source":"filestream-default"},"ecs.version":"1.6.0","log.logger":"esclientleg","log.origin":{"file.line":39,"file.name":"transport/logging.go","function":"github.com/elastic/elastic-agent-libs/transport/httpcommon.(*HTTPTransportSettings).RoundTripper.LoggingDialer.func2"},"service.name":"filebeat","network.transport":"tcp","server.address":"767b2b9229dd4bd098dab11b95e74c64.us-west2.gcp.elastic-cloud.com:443","ecs.version":"1.6.0"}
    
  4. Increase memory limits for Elastic Agent in Kustomize-based tests:

    • Multiple OOMKilled errors were observed (example 1, example 2).
    • A bump in memory limits is applied specifically for the Kustomize scenario, where all inputs run on a single DaemonSet pod, potentially spawning more Beat subprocesses.

Why is it important?

  • Ensures Kubernetes integration tests are resilient to GitHub outages and network instability by removing external dependencies.
  • Fixes broken functionality introduced in previous PRs related to API key handling.
  • Prevents TLS misconfiguration caused by unintended environment variable injections.
  • Addresses out-of-memory crashes that reduce test reliability and increase CI flakiness.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

None expected. All changes are isolated to the Kubernetes integration test setup and do not affect runtime or user configurations.

How to test this PR locally

mage integration:buildKubernetesTestData

EXTERNAL=true SNAPSHOT=true PACKAGES=docker DOCKER_VARIANTS=basic PLATFORMS=linux/arm64 mage package

INSTANCE_PROVISIONER="kind" TEST_PLATFORMS="kubernetes/arm64/1.33.0/basic" mage integration:TestKubernetes

Related issues

@pkoutsovasilis pkoutsovasilis self-assigned this Jun 18, 2025
@pkoutsovasilis pkoutsovasilis added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-active-all Automated backport with mergify to all the active branches labels Jun 18, 2025
@pkoutsovasilis pkoutsovasilis changed the title [ci] fix k8s integration test flakiness [ci] fix k8s integration tests flakiness Jun 18, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/k8s_integration_test_flakiness branch from 9e27cf0 to d00a1b8 Compare June 18, 2025 08:03
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review June 18, 2025 20:18
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner June 18, 2025 20:18
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis requested review from pchila and removed request for michalpristas June 18, 2025 20:18
@pkoutsovasilis pkoutsovasilis force-pushed the fix/k8s_integration_test_flakiness branch from d00a1b8 to b2d39c8 Compare June 18, 2025 20:41
pchila
pchila previously approved these changes Jun 19, 2025
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question and a small nitpick on some test path (both are non-blocking)
LGTM otherwise

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @pkoutsovasilis

pchila
pchila previously approved these changes Jun 19, 2025
@pchila
Copy link
Member

pchila commented Jun 19, 2025

After some testing with @pkoutsovasilis , it seems that we can vendor our helm dependencies in uncompressed form: this means that instead of including deploy/helm/elastic-agent/charts/kube-state-metrics-5.30.1.tgz we can include the uncompressed deploy/helm/elastic-agent/charts/kube-state-metrics directory (the same is valid for the testing/integration/k8s/testdata/opentelemetry-kube-stack-0.3.9.tgz helm chart in testdata.

This has the benefit of not having to include a binary file in our git changes and we can more clearly see what changes when chart version gets bumped.

@pkoutsovasilis could you please add a commit with such a change ?

pchila
pchila previously approved these changes Jun 20, 2025
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it increases the number of new files, I prefer the exploded charts to having a .tgz committed in git.
It's a shame for the lint GH action not to support diffs over 20k lines for PRs so it cannot filter the linter violations only to the modified files but this should not be a recurring problem.

swiatekm
swiatekm previously approved these changes Jun 20, 2025
Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's unfortunate that we need to vendor all this code, and in an ideal world I'd prefer if these artifacts were cached locally on our CI runners instead. But this should work and probably won't be too burdensome to maintain.

Left some questions and nitpicks that shouldn't block merging.

@pkoutsovasilis pkoutsovasilis dismissed stale reviews from swiatekm and pchila via 963742e June 20, 2025 11:13
@elastic-sonarqube
Copy link

@pkoutsovasilis pkoutsovasilis merged commit 7259e54 into elastic:main Jun 23, 2025
16 of 19 checks passed
@github-actions
Copy link
Contributor

@Mergifyio backport 8.17 8.18 8.19 9.0

@mergify
Copy link
Contributor

mergify bot commented Jun 23, 2025

backport 8.17 8.18 8.19 9.0

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Jun 23, 2025
* feat: vendor all necessary test artifacts for kubernetes integration to minimise flakiness due to transient errors

* fix: correct decode api key

* fix: clear CA_TRUSTED env var for kustomize

* fix: bump memory limits for kustomize

* fix: fabricate paths leveraging filepath

* fix: remove redundant file moving when downloading kube stack helm chart

* feat: vendor expanded archives

* fix: use filepath.Join

* doc: update BuildDependencies godoc

(cherry picked from commit 7259e54)

# Conflicts:
#	NOTICE-fips.txt
#	NOTICE.txt
#	go.mod
#	magefile.go
#	testing/integration/k8s/journald_test.go
#	testing/integration/k8s/kubernetes_agent_standalone_test.go
mergify bot pushed a commit that referenced this pull request Jun 23, 2025
* feat: vendor all necessary test artifacts for kubernetes integration to minimise flakiness due to transient errors

* fix: correct decode api key

* fix: clear CA_TRUSTED env var for kustomize

* fix: bump memory limits for kustomize

* fix: fabricate paths leveraging filepath

* fix: remove redundant file moving when downloading kube stack helm chart

* feat: vendor expanded archives

* fix: use filepath.Join

* doc: update BuildDependencies godoc

(cherry picked from commit 7259e54)

# Conflicts:
#	NOTICE-fips.txt
#	NOTICE.txt
#	go.mod
#	testing/integration/k8s/journald_test.go
#	testing/integration/k8s/kubernetes_agent_standalone_test.go
mergify bot pushed a commit that referenced this pull request Jun 23, 2025
* feat: vendor all necessary test artifacts for kubernetes integration to minimise flakiness due to transient errors

* fix: correct decode api key

* fix: clear CA_TRUSTED env var for kustomize

* fix: bump memory limits for kustomize

* fix: fabricate paths leveraging filepath

* fix: remove redundant file moving when downloading kube stack helm chart

* feat: vendor expanded archives

* fix: use filepath.Join

* doc: update BuildDependencies godoc

(cherry picked from commit 7259e54)

# Conflicts:
#	testing/integration/k8s/journald_test.go
#	testing/integration/k8s/kubernetes_agent_standalone_test.go
mergify bot pushed a commit that referenced this pull request Jun 23, 2025
* feat: vendor all necessary test artifacts for kubernetes integration to minimise flakiness due to transient errors

* fix: correct decode api key

* fix: clear CA_TRUSTED env var for kustomize

* fix: bump memory limits for kustomize

* fix: fabricate paths leveraging filepath

* fix: remove redundant file moving when downloading kube stack helm chart

* feat: vendor expanded archives

* fix: use filepath.Join

* doc: update BuildDependencies godoc

(cherry picked from commit 7259e54)

# Conflicts:
#	NOTICE-fips.txt
#	NOTICE.txt
#	go.mod
#	testing/integration/k8s/journald_test.go
pkoutsovasilis added a commit that referenced this pull request Jun 23, 2025
* [ci] fix k8s integration tests flakiness (#8575)

* feat: vendor all necessary test artifacts for kubernetes integration to minimise flakiness due to transient errors

* fix: correct decode api key

* fix: clear CA_TRUSTED env var for kustomize

* fix: bump memory limits for kustomize

* fix: fabricate paths leveraging filepath

* fix: remove redundant file moving when downloading kube stack helm chart

* feat: vendor expanded archives

* fix: use filepath.Join

* doc: update BuildDependencies godoc

(cherry picked from commit 7259e54)

# Conflicts:
#	testing/integration/k8s/journald_test.go
#	testing/integration/k8s/kubernetes_agent_standalone_test.go

* fix: resolve conflicts

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Jun 23, 2025
* [ci] fix k8s integration tests flakiness (#8575)

* feat: vendor all necessary test artifacts for kubernetes integration to minimise flakiness due to transient errors

* fix: correct decode api key

* fix: clear CA_TRUSTED env var for kustomize

* fix: bump memory limits for kustomize

* fix: fabricate paths leveraging filepath

* fix: remove redundant file moving when downloading kube stack helm chart

* feat: vendor expanded archives

* fix: use filepath.Join

* doc: update BuildDependencies godoc

(cherry picked from commit 7259e54)

# Conflicts:
#	NOTICE-fips.txt
#	NOTICE.txt
#	go.mod
#	magefile.go
#	testing/integration/k8s/journald_test.go
#	testing/integration/k8s/kubernetes_agent_standalone_test.go

* fix: resolve conflicts

* fix: rework CA_TRUSTED elimination

* fix: add ELASTIC_AGENT_OTEL in TestKubernetesAgentOtel

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Jun 23, 2025
* [ci] fix k8s integration tests flakiness (#8575)

* feat: vendor all necessary test artifacts for kubernetes integration to minimise flakiness due to transient errors

* fix: correct decode api key

* fix: clear CA_TRUSTED env var for kustomize

* fix: bump memory limits for kustomize

* fix: fabricate paths leveraging filepath

* fix: remove redundant file moving when downloading kube stack helm chart

* feat: vendor expanded archives

* fix: use filepath.Join

* doc: update BuildDependencies godoc

(cherry picked from commit 7259e54)

# Conflicts:
#	NOTICE-fips.txt
#	NOTICE.txt
#	go.mod
#	testing/integration/k8s/journald_test.go
#	testing/integration/k8s/kubernetes_agent_standalone_test.go

* fix: resolve conflicts

* fix: update NOTICE.txt

* fix: rework CA_TRUSTED elimination

* fix: add ELASTIC_AGENT_OTEL in TestKubernetesAgentOtel

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Jun 23, 2025
* [ci] fix k8s integration tests flakiness (#8575)

* feat: vendor all necessary test artifacts for kubernetes integration to minimise flakiness due to transient errors

* fix: correct decode api key

* fix: clear CA_TRUSTED env var for kustomize

* fix: bump memory limits for kustomize

* fix: fabricate paths leveraging filepath

* fix: remove redundant file moving when downloading kube stack helm chart

* feat: vendor expanded archives

* fix: use filepath.Join

* doc: update BuildDependencies godoc

(cherry picked from commit 7259e54)

# Conflicts:
#	NOTICE-fips.txt
#	NOTICE.txt
#	go.mod
#	testing/integration/k8s/journald_test.go

* fix: resolve conflicts

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
v1v added a commit that referenced this pull request Jun 25, 2025
…-hosted

* feature/hosted-stack-using-oblt-cli: (26 commits)
  Use the current official docker image for oblt-cli
  Mark the elasticinframetrics processor as deprecated and schedule for removal (#8659)
  [main][Automation] Update versions (#8668)
  chore: Update create_deployment_csp_configuration.yaml (#8669)
  Attempt to make test more reliable by querying ES directly (#8422)
  [test] split up ess and beats serverless integration tests (#8551)
  Remove resource/k8s processor and use k8sattributes processor for service attributes (#8599)
  fix: use --force-confold for deb tests in TestUpgradeAgentWithTamperProtectedEndpoint_DEB (#8649)
  [main][Automation] Bump stack images versions to 9.1.0-ea0b7542 (#8612)
  chore: Update to elastic/beats@f6594fb72670 (#8640)
  [deb/rpm] restart endpoint with tamper protection after elastic-agent  (#8637)
  ci: don't preinstall fleet packages on retried CI steps (#8636)
  chore: Update to elastic/beats@6b6941eed496 (#8619)
  [main][Automation] Bump VM Image version to 1750467641 (#8617)
  flaky: skip TestUpgradeAgentWithTamperProtectedEndpoint_RPM (#8626)
  Add skip-changelog PR label for bump VM PRs (#8627)
  build(deps): bump github.com/elastic/go-seccomp-bpf from 1.5.0 to 1.6.0 (#8611)
  [ci] fix k8s integration tests flakiness (#8575)
  bump apmconfig Otel extension to v0.3.0 (#8600)
  Enhancement/6394 allow deb rpm to upgrade with endpoint tamper protection (#6907)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Flaky Test]: TestKubernetesAgentStandaloneKustomize – failed to render kustomize

4 participants