Skip to content

Commit 9c8ed19

Browse files
authored
feat: remove CR meters when they are deleted (after a delay) (#1805)
* feat: remove CR meters when they are deleted (after a delay) Fixes #1803. Also added documentation for the Metrics feature.
1 parent 00da140 commit 9c8ed19

File tree

4 files changed

+285
-13
lines changed

4 files changed

+285
-13
lines changed

docs/documentation/features.md

+45
Original file line numberDiff line numberDiff line change
@@ -757,6 +757,51 @@ with a `mycrs` plural form will result in 2 files:
757757
> Quarkus users using the `quarkus-operator-sdk` extension do not need to add any extra dependency
758758
> to get their CRD generated as this is handled by the extension itself.
759759

760+
## Metrics
761+
762+
JOSDK provides built-in support for metrics reporting on what is happening with your reconcilers in the form of
763+
the `Metrics` interface which can be implemented to connect to your metrics provider of choice, JOSDK calling the
764+
methods as it goes about reconciling resources. By default, a no-operation implementation is provided thus providing a
765+
no-cost sane default. A [micrometer](https://micrometer.io)-based implementation is also provided.
766+
767+
You can use a different implementation by overriding the default one provided by the default `ConfigurationService`, as
768+
follows:
769+
770+
```java
771+
Metrics metrics= …;
772+
ConfigurationServiceProvider.overrideCurrent(overrider->overrider.withMetrics(metrics));
773+
```
774+
775+
### Micrometer implementation
776+
777+
The micrometer implementation records a lot of metrics associated to each resource handled by the operator by default.
778+
In order to be efficient, the implementation removes meters associated with resources when they are deleted. Since it
779+
might be useful to keep these metrics around for a bit before they are deleted, it is possible to configure a delay
780+
before their removal. As this is done asynchronously, it is also possible to configure how many threads you want to
781+
devote to these operations. Both aspects are controlled by the `MicrometerMetrics` constructor so changing the defaults
782+
is a matter of instantiating `MicrometerMetrics` with the desired values and tell `ConfigurationServiceProvider` about
783+
it as shown above.
784+
785+
The micrometer implementation records the following metrics:
786+
787+
| Meter name | Type | Tags | Description |
788+
|-----------------------------------------------------------|----------------|------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
789+
| operator.sdk.reconciliations.executions.<reconciler name> | gauge | group, version, kind | Number of executions of the named reconciler |
790+
| operator.sdk.reconciliations.queue.size.<reconciler name> | gauge | group, version, kind | How many resources are queued to get reconciled by named reconciler |
791+
| operator.sdk.<map name>.size | gauge map size | | Gauge tracking the size of a specified map (currently unused but could be used to monitor caches size) |
792+
| operator.sdk.events.received | counter | group, version, kind, name, namespace, scope, event, action | Number of received Kubernetes events |
793+
| operator.sdk.events.delete | counter | group, version, kind, name, namespace, scope | Number of received Kubernetes delete events |
794+
| operator.sdk.reconciliations.started | counter | group, version, kind, name, namespace, scope, reconciliations.retries.last, reconciliations.retries.number | Number of started reconciliations per resource type |
795+
| operator.sdk.reconciliations.failed | counter | group, version, kind, name, namespace, scope, exception | Number of failed reconciliations per resource type |
796+
| operator.sdk.reconciliations.success | counter | group, version, kind, name, namespace, scope | Number of successful reconciliations per resource type |
797+
| operator.sdk.controllers.execution.reconcile.success | counter | controller, type | Number of successful reconciliations per controller |
798+
| operator.sdk.controllers.execution.reconcile.failure | counter | controller, exception | Number of failed reconciliations per controller |
799+
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
800+
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |
801+
802+
As you can see all the recorded metrics start with the `operator.sdk` prefix.
803+
804+
760805
## Optimizing Caches
761806

762807
One of the ideas around the operator pattern is that all the relevant resources are cached, thus reconciliation is

micrometer-support/pom.xml

+31
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,37 @@
2626
<groupId>io.javaoperatorsdk</groupId>
2727
<artifactId>operator-framework-core</artifactId>
2828
</dependency>
29+
<dependency>
30+
<groupId>org.junit.jupiter</groupId>
31+
<artifactId>junit-jupiter-api</artifactId>
32+
<scope>test</scope>
33+
</dependency>
34+
<dependency>
35+
<groupId>org.junit.jupiter</groupId>
36+
<artifactId>junit-jupiter-engine</artifactId>
37+
<scope>test</scope>
38+
</dependency>
39+
<dependency>
40+
<groupId>org.assertj</groupId>
41+
<artifactId>assertj-core</artifactId>
42+
<scope>test</scope>
43+
</dependency>
44+
<dependency>
45+
<groupId>org.awaitility</groupId>
46+
<artifactId>awaitility</artifactId>
47+
<scope>test</scope>
48+
</dependency>
49+
<dependency>
50+
<groupId>io.javaoperatorsdk</groupId>
51+
<artifactId>operator-framework-junit-5</artifactId>
52+
<version>${project.version}</version>
53+
<scope>test</scope>
54+
</dependency>
55+
<dependency>
56+
<groupId>io.fabric8</groupId>
57+
<artifactId>kubernetes-httpclient-vertx</artifactId>
58+
<scope>test</scope>
59+
</dependency>
2960
</dependencies>
3061

3162
</project>

micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetrics.java

+112-13
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,10 @@
11
package io.javaoperatorsdk.operator.monitoring.micrometer;
22

3-
import java.util.ArrayList;
4-
import java.util.Collections;
5-
import java.util.List;
6-
import java.util.Map;
7-
import java.util.Optional;
3+
import java.util.*;
84
import java.util.concurrent.ConcurrentHashMap;
5+
import java.util.concurrent.Executors;
6+
import java.util.concurrent.ScheduledExecutorService;
7+
import java.util.concurrent.TimeUnit;
98
import java.util.concurrent.atomic.AtomicInteger;
109

1110
import io.fabric8.kubernetes.api.model.HasMetadata;
@@ -17,6 +16,8 @@
1716
import io.javaoperatorsdk.operator.processing.GroupVersionKind;
1817
import io.javaoperatorsdk.operator.processing.event.Event;
1918
import io.javaoperatorsdk.operator.processing.event.ResourceID;
19+
import io.javaoperatorsdk.operator.processing.event.source.controller.ResourceEvent;
20+
import io.micrometer.core.instrument.Meter;
2021
import io.micrometer.core.instrument.MeterRegistry;
2122
import io.micrometer.core.instrument.Tag;
2223
import io.micrometer.core.instrument.Timer;
@@ -31,9 +32,53 @@ public class MicrometerMetrics implements Metrics {
3132
private static final String RECONCILIATIONS_QUEUE_SIZE = PREFIX + RECONCILIATIONS + "queue.size.";
3233
private final MeterRegistry registry;
3334
private final Map<String, AtomicInteger> gauges = new ConcurrentHashMap<>();
35+
private final Map<ResourceID, Set<Meter.Id>> metersPerResource = new ConcurrentHashMap<>();
36+
private final Cleaner cleaner;
3437

38+
/**
39+
* Creates a non-delayed, micrometer-based Metrics implementation. The non-delayed part refers to
40+
* the cleaning of meters associated with deleted resources.
41+
*
42+
* @param registry the {@link MeterRegistry} instance to use for metrics recording
43+
*/
3544
public MicrometerMetrics(MeterRegistry registry) {
45+
this(registry, 0);
46+
}
47+
48+
/**
49+
* Creates a micrometer-based Metrics implementation that delays cleaning up {@link Meter}s
50+
* associated with deleted resources by the specified amount of seconds, using a single thread for
51+
* that process.
52+
*
53+
* @param registry the {@link MeterRegistry} instance to use for metrics recording
54+
* @param cleanUpDelayInSeconds the number of seconds to wait before meters are removed for
55+
* deleted resources
56+
*/
57+
public MicrometerMetrics(MeterRegistry registry, int cleanUpDelayInSeconds) {
58+
this(registry, cleanUpDelayInSeconds, 1);
59+
}
60+
61+
/**
62+
* Creates a micrometer-based Metrics implementation that delays cleaning up {@link Meter}s
63+
* associated with deleted resources by the specified amount of seconds, using the specified
64+
* (maximally) number of threads for that process.
65+
*
66+
* @param registry the {@link MeterRegistry} instance to use for metrics recording
67+
* @param cleanUpDelayInSeconds the number of seconds to wait before meters are removed for
68+
* deleted resources
69+
* @param cleaningThreadsNumber the number of threads to use for the cleaning process
70+
*/
71+
public MicrometerMetrics(MeterRegistry registry, int cleanUpDelayInSeconds,
72+
int cleaningThreadsNumber) {
3673
this.registry = registry;
74+
if (cleanUpDelayInSeconds < 0) {
75+
cleaner = new NoDelayCleaner();
76+
} else {
77+
cleaningThreadsNumber =
78+
cleaningThreadsNumber <= 0 ? Runtime.getRuntime().availableProcessors()
79+
: cleaningThreadsNumber;
80+
cleaner = new DelayedCleaner(cleanUpDelayInSeconds, cleaningThreadsNumber);
81+
}
3782
}
3883

3984
@Override
@@ -108,14 +153,24 @@ private static String getScope(ResourceID resourceID) {
108153

109154
@Override
110155
public void receivedEvent(Event event, Map<String, Object> metadata) {
156+
final String[] tags;
157+
if (event instanceof ResourceEvent) {
158+
tags = new String[] {"event", event.getClass().getSimpleName(), "action",
159+
((ResourceEvent) event).getAction().toString()};
160+
} else {
161+
tags = new String[] {"event", event.getClass().getSimpleName()};
162+
}
163+
111164
incrementCounter(event.getRelatedCustomResourceID(), "events.received",
112165
metadata,
113-
"event", event.getClass().getSimpleName());
166+
tags);
114167
}
115168

116169
@Override
117170
public void cleanupDoneFor(ResourceID resourceID, Map<String, Object> metadata) {
118171
incrementCounter(resourceID, "events.delete", metadata);
172+
173+
cleaner.removeMetersFor(resourceID);
119174
}
120175

121176
@Override
@@ -125,11 +180,11 @@ public void reconcileCustomResource(HasMetadata resource, RetryInfo retryInfoNul
125180
incrementCounter(ResourceID.fromResource(resource), RECONCILIATIONS + "started",
126181
metadata,
127182
RECONCILIATIONS + "retries.number",
128-
"" + retryInfo.map(RetryInfo::getAttemptCount).orElse(0),
183+
String.valueOf(retryInfo.map(RetryInfo::getAttemptCount).orElse(0)),
129184
RECONCILIATIONS + "retries.last",
130-
"" + retryInfo.map(RetryInfo::isLastAttempt).orElse(true));
185+
String.valueOf(retryInfo.map(RetryInfo::isLastAttempt).orElse(true)));
131186

132-
AtomicInteger controllerQueueSize =
187+
var controllerQueueSize =
133188
gauges.get(RECONCILIATIONS_QUEUE_SIZE + metadata.get(CONTROLLER_NAME));
134189
controllerQueueSize.incrementAndGet();
135190
}
@@ -141,18 +196,18 @@ public void finishedReconciliation(HasMetadata resource, Map<String, Object> met
141196

142197
@Override
143198
public void reconciliationExecutionStarted(HasMetadata resource, Map<String, Object> metadata) {
144-
AtomicInteger reconcilerExecutions =
199+
var reconcilerExecutions =
145200
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
146201
reconcilerExecutions.incrementAndGet();
147202
}
148203

149204
@Override
150205
public void reconciliationExecutionFinished(HasMetadata resource, Map<String, Object> metadata) {
151-
AtomicInteger reconcilerExecutions =
206+
var reconcilerExecutions =
152207
gauges.get(RECONCILIATIONS_EXECUTIONS + metadata.get(CONTROLLER_NAME));
153208
reconcilerExecutions.decrementAndGet();
154209

155-
AtomicInteger controllerQueueSize =
210+
var controllerQueueSize =
156211
gauges.get(RECONCILIATIONS_QUEUE_SIZE + metadata.get(CONTROLLER_NAME));
157212
controllerQueueSize.decrementAndGet();
158213
}
@@ -202,6 +257,50 @@ private void incrementCounter(ResourceID id, String counterName, Map<String, Obj
202257
"version", gvk.version,
203258
"kind", gvk.kind));
204259
}
205-
registry.counter(PREFIX + counterName, tags.toArray(new String[0])).increment();
260+
final var counter = registry.counter(PREFIX + counterName, tags.toArray(new String[0]));
261+
metersPerResource.computeIfAbsent(id, resourceID -> new HashSet<>()).add(counter.getId());
262+
counter.increment();
263+
}
264+
265+
protected Set<Meter.Id> recordedMeterIdsFor(ResourceID resourceID) {
266+
return metersPerResource.get(resourceID);
267+
}
268+
269+
private interface Cleaner {
270+
void removeMetersFor(ResourceID resourceID);
271+
}
272+
273+
private void removeMetersFor(ResourceID resourceID) {
274+
// remove each meter
275+
final var toClean = metersPerResource.get(resourceID);
276+
if (toClean != null) {
277+
toClean.forEach(registry::remove);
278+
}
279+
// then clean-up local recording of associations
280+
metersPerResource.remove(resourceID);
281+
}
282+
283+
private class NoDelayCleaner implements Cleaner {
284+
@Override
285+
public void removeMetersFor(ResourceID resourceID) {
286+
MicrometerMetrics.this.removeMetersFor(resourceID);
287+
}
288+
}
289+
290+
private class DelayedCleaner implements Cleaner {
291+
private final ScheduledExecutorService metersCleaner;
292+
private final int cleanUpDelayInSeconds;
293+
294+
private DelayedCleaner(int cleanUpDelayInSeconds, int cleaningThreadsNumber) {
295+
this.cleanUpDelayInSeconds = cleanUpDelayInSeconds;
296+
this.metersCleaner = Executors.newScheduledThreadPool(cleaningThreadsNumber);
297+
}
298+
299+
@Override
300+
public void removeMetersFor(ResourceID resourceID) {
301+
// schedule deletion of meters associated with ResourceID
302+
metersCleaner.schedule(() -> MicrometerMetrics.this.removeMetersFor(resourceID),
303+
cleanUpDelayInSeconds, TimeUnit.SECONDS);
304+
}
206305
}
207306
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
package io.javaoperatorsdk.operator.monitoring.micrometer;
2+
3+
import java.time.Duration;
4+
import java.util.HashSet;
5+
import java.util.Set;
6+
7+
import org.junit.jupiter.api.AfterAll;
8+
import org.junit.jupiter.api.BeforeAll;
9+
import org.junit.jupiter.api.Test;
10+
import org.junit.jupiter.api.extension.RegisterExtension;
11+
12+
import io.fabric8.kubernetes.api.model.ConfigMap;
13+
import io.fabric8.kubernetes.api.model.ConfigMapBuilder;
14+
import io.javaoperatorsdk.operator.api.config.ConfigurationServiceProvider;
15+
import io.javaoperatorsdk.operator.api.reconciler.*;
16+
import io.javaoperatorsdk.operator.junit.LocallyRunOperatorExtension;
17+
import io.javaoperatorsdk.operator.processing.event.ResourceID;
18+
import io.micrometer.core.instrument.Meter;
19+
import io.micrometer.core.instrument.simple.SimpleMeterRegistry;
20+
21+
import static org.assertj.core.api.Assertions.assertThat;
22+
import static org.awaitility.Awaitility.await;
23+
24+
public class MetricsCleaningOnDeleteIT {
25+
@RegisterExtension
26+
static LocallyRunOperatorExtension operator =
27+
LocallyRunOperatorExtension.builder().withReconciler(new MetricsCleaningTestReconciler())
28+
.build();
29+
30+
private static final TestSimpleMeterRegistry registry = new TestSimpleMeterRegistry();
31+
private static final int testDelay = 1;
32+
private static final MicrometerMetrics metrics = new MicrometerMetrics(registry, testDelay, 2);
33+
private static final String testResourceName = "cleaning-metrics-cr";
34+
35+
@BeforeAll
36+
static void setup() {
37+
ConfigurationServiceProvider.overrideCurrent(overrider -> overrider.withMetrics(metrics));
38+
}
39+
40+
@AfterAll
41+
static void reset() {
42+
ConfigurationServiceProvider.reset();
43+
}
44+
45+
@Test
46+
void removesMetersAssociatedWithResourceAfterItsDeletion() throws InterruptedException {
47+
var testResource = new ConfigMapBuilder()
48+
.withNewMetadata()
49+
.withName(testResourceName)
50+
.endMetadata()
51+
.build();
52+
final var created = operator.create(testResource);
53+
54+
// make sure the resource is created
55+
await().until(() -> !operator.get(ConfigMap.class, testResourceName)
56+
.getMetadata().getFinalizers().isEmpty());
57+
58+
// check that we properly recorded meters associated with the resource
59+
final var meters = metrics.recordedMeterIdsFor(ResourceID.fromResource(created));
60+
assertThat(meters).isNotNull();
61+
assertThat(meters).isNotEmpty();
62+
63+
// delete the resource and wait for it to be deleted
64+
operator.delete(testResource);
65+
await().until(() -> operator.get(ConfigMap.class, testResourceName) == null);
66+
67+
// check that the meters are properly removed after the specified delay
68+
Thread.sleep(Duration.ofSeconds(testDelay).toMillis());
69+
assertThat(registry.removed).isEqualTo(meters);
70+
assertThat(metrics.recordedMeterIdsFor(ResourceID.fromResource(created))).isNull();
71+
}
72+
73+
@ControllerConfiguration
74+
private static class MetricsCleaningTestReconciler
75+
implements Reconciler<ConfigMap>, Cleaner<ConfigMap> {
76+
@Override
77+
public UpdateControl<ConfigMap> reconcile(ConfigMap resource, Context<ConfigMap> context) {
78+
return UpdateControl.noUpdate();
79+
}
80+
81+
@Override
82+
public DeleteControl cleanup(ConfigMap resource, Context<ConfigMap> context) {
83+
return DeleteControl.defaultDelete();
84+
}
85+
}
86+
87+
private static class TestSimpleMeterRegistry extends SimpleMeterRegistry {
88+
private final Set<Meter.Id> removed = new HashSet<>();
89+
90+
@Override
91+
public Meter remove(Meter.Id mappedId) {
92+
final var removed = super.remove(mappedId);
93+
this.removed.add(removed.getId());
94+
return removed;
95+
}
96+
}
97+
}

0 commit comments

Comments
 (0)