Skip to content

[observability] most basic OpenTelemetry integration into MCK #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

nammn
Copy link
Collaborator

@nammn nammn commented May 7, 2025

Summary

This pull request introduces OpenTelemetry tracing support to the MongoDB Kubernetes Operator and its related components. Key changes include the integration of OpenTelemetry libraries, the addition of tracing configuration, and updates to ensure trace propagation across the application. These changes enhance observability and debugging capabilities.

In our CI suite this means we will have the following kind of traces:

trace_id: abc123

                         ┌────────────────────┐
                         │     Evergreen      │
                         │  span_id: ROOT     │
                         │  parent_id: none   │
                         └─────────┬──────────┘
                                   │
        ┌──────────────────────────┼─────────────────────────┐
        ▼                          ▼                         ▼
┌──────────────┐         ┌────────────────┐         ┌────────────────────┐
│   E2E Test   │         │   Operator     │         │     (Other…)       │
│ span_id: A1  │         │ span_id: B1    │         │                    │
│ parent: ROOT │         │ parent: ROOT   │         │                    │
└──────┬───────┘         └──────┬─────────┘         └────────────────────┘
       │                        │
       ▼                        ▼
┌──────────────┐         ┌────────────────────┐
│ E2E Function │         │   Reconcile Loop   │
│ span_id: A2  │         │   span_id: B2      │
│ parent: A1   │         │   parent: B1       │
└──────────────┘         └────────────────────┘

OpenTelemetry Integration:

  • Tracing in main.go:
    • Added OpenTelemetry setup in the main function, including trace and span ID extraction from environment variables and the creation of a root span for the operator. Tracing context is propagated across controllers and shutdown processes are handled gracefully.
  • Telemetry in pkg/telemetry/client.go: <--- this is good to know if we happen to make a change and happen to send to prod atlas
    • Added a span to the SendEventWithRetry function to capture telemetry events and include the Atlas base URL as a span attribute.

Helm Chart Updates:

  • Operator configuration:
    • Added OpenTelemetry-specific environment variables (OTEL_TRACE_ID, OTEL_PARENT_ID, OTEL_EXPORTER_OTLP_ENDPOINT) to the operator's deployment template. ([helm_chart/templates/operator.yamlR83-R90](https://github.com/mongodb/mongodb-kubernetes/pull/93/files#diff-5d2e377a6806023ca9eff60be4d7e5cd879803de2bd3800b630f479f8728f322R83-R90))
    • Introduced OpenTelemetry configuration options (enabled, traceID, parentID, collectorEndpoint) in the Helm chart's values.yaml.

Dependency Updates:

  • Go module dependencies:
    • Added OpenTelemetry-related libraries (otel, otel/sdk, otel/trace, etc.) to go.mod.

Proof of Work

  • e.g. patch

  • generated traces in our ci: Link
    Screenshot 2025-05-21 at 15 19 20

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you checked for release_note changes?

Reminder (Please remove this when merging)

  • Please try to Approve or Reject Changes the PR, keep PRs in review as short as possible
  • Our Short Guide for PRs: Link
  • Remember the following Communication Standards - use comment prefixes for clarity:
    • blocking: Must be addressed before approval.
    • follow-up: Can be addressed in a later PR or ticket.
    • q: Clarifying question.
    • nit: Non-blocking suggestions.
    • note: Side-note, non-actionable. Example: Praise
    • --> no prefix is considered a question

@nammn nammn changed the title add initial operator tracing support OpenTelemetry integration into MCK May 7, 2025
@nammn nammn force-pushed the traces-operator branch from 19334f0 to 388f4ad Compare May 21, 2025 08:00
@nammn nammn changed the title OpenTelemetry integration into MCK most basic OpenTelemetry integration into MCK May 21, 2025
@nammn nammn changed the title most basic OpenTelemetry integration into MCK [observability] most basic OpenTelemetry integration into MCK May 21, 2025
@nammn nammn force-pushed the traces-operator branch 3 times, most recently from 1b6922c to 3420830 Compare May 21, 2025 08:45
- name: OTEL_SERVICE_NAME
value: evergreen-agent
value: mongodb-e2e-tests
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the service-name for better querying

reset_namespace "$(kubectl config current-context)" "${NAMESPACE}" || true
fi
# If the test passed, then the namespace is removed
delete_operator "${NAMESPACE}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove the operator to test teardown at the end of the test-run + it enables us to ensure traces are exported in time

@nammn nammn force-pushed the traces-operator branch from 2391b87 to b75faef Compare May 21, 2025 13:22
@nammn nammn marked this pull request as ready for review May 21, 2025 13:23
@nammn nammn requested a review from a team as a code owner May 21, 2025 13:23
Copilot

This comment was marked as outdated.

Copy link
Contributor

@lsierant lsierant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very, very nice change and I would like to have it included. But as of right now I think we should discuss this PR a bit more before deciding to merge. There are too many end-user implications to just LGTM it now.

@nammn nammn force-pushed the traces-operator branch from a3c5077 to 1441f82 Compare May 23, 2025 13:01
helm uninstall --kube-context="${context}" mongodb-enterprise-operator || true &
helm uninstall --kube-context="${context}" mongodb-community-operator || true &
helm uninstall --kube-context="${context}" mongodb-enterprise-operator-multi-cluster || true &
helm uninstall --kube-context="${context}" mongodb-kubernetes-operator || true &
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have one operator now - this is a cleanup

@@ -162,7 +160,7 @@ reset_namespace() {
# a while to delete it.
should_wait="false"
# shellcheck disable=SC2153
if [[ ${CURRENT_VARIANT_CONTEXT} == e2e_mdb_openshift_ubi_cloudqa || ${CURRENT_VARIANT_CONTEXT} == e2e_openshift_static_mdb_ubi_cloudqa ]]; then
if [[ ${KUBE_ENVIRONMENT_NAME} == "openshift" ]]; then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup

@@ -35,11 +35,10 @@ EOF

delete_operator() {
local ns="$1"
local name=${OPERATOR_NAME:=mongodb-enterprise-operator}
local name=${OPERATOR_NAME:=mongodb-kubernetes-operator}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was not running before and therefore we had no flushing


title "Removing the Operator deployment ${name}"
! kubectl --namespace "${ns}" get deployments | grep -q "${name}" \
|| kubectl delete deployment "${name}" -n "${ns}" || true
kubectl delete deployment "${name}" -n "${ns}" --wait=true --timeout=10s|| true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's wait until all things are flushed before we stop. PoW is in the PR description - you can see the trace of the oeprator

@@ -83,6 +83,16 @@ spec:
valueFrom:
fieldRef:
fieldPath: metadata.namespace
{{- $opentelemetry := default dict .Values.operator.opentelemetry }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pow:

(venv) ~/projects/ops-manager-kubernetes git:[traces-operator]
helm template helm_chart | rg OTEL
(venv) ~/projects/ops-manager-kubernetes git:[traces-operator]
helm template --set operator.opentelemetry.tracing.enabled=true --set operator.opentelemetry.tracing.traceID=your-trace-id --set operator.opentelemetry.tracing.parentID=your-parent-id --set operator.opentelemetry.tracing.collectorEndpoint=http://jaeger:14268/api/traces helm_chart | rg OTEL
            - name: OTEL_TRACE_ID
            - name: OTEL_PARENT_ID
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
(venv) ~/projects/ops-manager-kubernetes git:[traces-operator]
helm template --set operator.opentelemetry.tracing.enabled=true --set operator.opentelemetry.tracing.traceID=your-trace-id --set operator.opentelemetry.tracing.parentID=your-parent-id --set operator.opentelemetry.tracing.collectorEndpoint=http://jaeger:14268/api/traces helm_chart | rg OTEL -C 2
                fieldRef:
                  fieldPath: metadata.namespace
            - name: OTEL_TRACE_ID
              value: "your-trace-id"
            - name: OTEL_PARENT_ID
              value: "your-parent-id"
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://jaeger:14268/api/traces"
            - name: WATCH_NAMESPACE

@nammn nammn requested review from lsierant and Copilot July 11, 2025 09:55
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Introduces basic OpenTelemetry tracing into the MongoDB Kubernetes Operator (MCK) and related scripts, enabling trace propagation from CI tests through operator spans.

  • Initializes tracing in main.go using OTEL env vars and creates a root operator span.
  • Instruments core telemetry functions (trace.go, configmap.go, collector.go, client.go) with spans and attributes.
  • Propagates OTEL settings through Helm charts, shell scripts, and test configurations.

Reviewed Changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/funcs/operator_deployment Adds parsing of OTEL env vars to Helm values
scripts/funcs/kubernetes Updates default operator name and cleanup/uninstall logic
scripts/evergreen/e2e/e2e.sh Refactors cluster diagnostics and OpenShift cleanup sequences
scripts/evergreen/deployments/test-app/templates/mongodb-enterprise-tests.yaml Sets OTEL_SERVICE_NAME and pytest --trace-parent flags
pkg/telemetry/trace.go Implements SetupTracingFromParent with OTLP exporter
pkg/telemetry/configmap.go Wraps ConfigMap creation in a tracing span
pkg/telemetry/collector.go Adds a span around RunTelemetry
pkg/telemetry/client.go Adds a span in SendEventWithRetry with Atlas base URL
pipeline.py Clarifies trace_flags comment
main.go Hooks up tracing setup, root span, and tracer shutdown
helm_chart/templates/operator.yaml Injects OTEL env vars into the operator deployment
go.mod Adds OpenTelemetry dependencies
docker/mongodb-kubernetes-tests/tests/conftest.py Uses logger.debug instead of print and fixes a typo
LICENSE-THIRD-PARTY Updates third-party license entries for new dependencies
Comments suppressed due to low confidence (1)

pkg/telemetry/client.go:118

  • No unit tests cover the new OpenTelemetry instrumentation in SendEventWithRetry; consider adding tests to validate span creation and attribute setting.
	_, span := TRACER.Start(ctx, "SendEventWithRetry")

@@ -82,6 +83,13 @@ func updateConfigMapWithNewUUID(ctx context.Context, k8sClient kubeclient.Client

// Creates a new ConfigMap with a generated UUID
func createNewConfigMap(ctx context.Context, k8sClient kubeclient.Client, namespace string) string {
_, span := TRACER.Start(ctx, "createNewConfigMap")
Copy link
Preview

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider calling span.RecordError(err) inside the error branch of the ConfigMap creation to capture failures in the trace.

Copilot uses AI. Check for mistakes.

@@ -149,7 +149,7 @@ fi
dump_cluster_information

# We only have static clusters in OpenShift; otherwise, there's no need to mark and clean them up here.
if [[ "${CLUSTER_TYPE}" == "openshift" ]]; then
if [[ "${KUBE_ENVIRONMENT_NAME}" == *openshift* ]]; then
Copy link
Collaborator Author

@nammn nammn Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KUBE_ENVIRONMENT_NAME="dev-openshift-cluster"

if [[ "${KUBE_ENVIRONMENT_NAME}" == *openshift* ]]; then
  echo "Contains openshift"
else
  echo "Does NOT contain openshift"
fi

-> Contains openshift

@@ -69,9 +69,6 @@ get_operator_helm_values() {
comma_separated_list="$(echo "${MEMBER_CLUSTERS}" | tr ' ' ',')"
# shellcheck disable=SC2154
config+=("multiCluster.clusters={${comma_separated_list}}")
fi

if [[ "${KUBE_ENVIRONMENT_NAME:-}" == "multi" ]]; then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason it was duplicated

Copy link
Contributor

@lsierant lsierant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! LGTM!

@nammn nammn enabled auto-merge (squash) July 11, 2025 11:43
@nammn nammn disabled auto-merge July 11, 2025 11:43
@nammn nammn enabled auto-merge (squash) July 11, 2025 12:14
@nammn nammn requested review from m1kola and anandsyncs July 11, 2025 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants