Skip to content

Commit 0b208a2

Browse files
metacosmcsviriscrocquesel
authored
feat: now possible to only output non-resource related metrics (#1823)
* feat: bounded cache for informers (#1718) * fix: typo caffein -> caffeine (#1795) * feat: now possible to only output non-resource related metrics Fixes #1812. * refactor: extract abstract test fixture to add tests with variations * fix: add missing annotation * tests: add more test variations * fix: make operator non-static so it's registered once per test subclass * feat: introduce builder for MicrometerMetrics, fix test * fix: exclude more tags when not collecting per resource * fix: registry should be per-instance to ensure test independence * fix: make sure we wait a little to ensure event is properly processed * fix: make things work on Java 11, format * fix: also clean metrics on finalizer removal This is needed because the finalizer will trigger a reconciliation that adds a resource-specific metric. * fix: format * refactor: extract common tags Co-authored-by: Sébastien CROCQUESEL <[email protected]> * feat: make per-resource collecting finer-grained We now still collect GVK information when per-resource collection is switched off. * fix: do not create tag for group if not present * fix: remove unreliable no-delay implementation, defaulting to 1s delay * refactor: renamed & documented factory methods to make things clearer * docs: updated metrics section for code changes * feat: avoid emitting tag on empty value * docs: update * fix: format [skip ci] * refactor: use Tag more directly, avoid unneeded work, use constants * fix: change will happen instead of might * docs: add missing timer Co-authored-by: Sébastien CROCQUESEL <[email protected]> * docs: fix wrong & missing information * refactor: add constants * fix: wording [skip ci] Co-authored-by: Attila Mészáros <[email protected]> --------- Co-authored-by: Attila Mészáros <[email protected]> Co-authored-by: Sébastien CROCQUESEL <[email protected]>
1 parent 2b0280f commit 0b208a2

File tree

10 files changed

+452
-180
lines changed

10 files changed

+452
-180
lines changed

docs/documentation/features.md

+53-24
Original file line numberDiff line numberDiff line change
@@ -774,33 +774,62 @@ ConfigurationServiceProvider.overrideCurrent(overrider->overrider.withMetrics(me
774774

775775
### Micrometer implementation
776776

777-
The micrometer implementation records a lot of metrics associated to each resource handled by the operator by default.
778-
In order to be efficient, the implementation removes meters associated with resources when they are deleted. Since it
779-
might be useful to keep these metrics around for a bit before they are deleted, it is possible to configure a delay
780-
before their removal. As this is done asynchronously, it is also possible to configure how many threads you want to
781-
devote to these operations. Both aspects are controlled by the `MicrometerMetrics` constructor so changing the defaults
782-
is a matter of instantiating `MicrometerMetrics` with the desired values and tell `ConfigurationServiceProvider` about
783-
it as shown above.
777+
The micrometer implementation is typically created using one of the provided factory methods which, depending on which
778+
is used, will return either a ready to use instance or a builder allowing users to customized how the implementation
779+
behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect
780+
metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but
781+
this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which
782+
could lead to performance issues.
784783

785-
The micrometer implementation records the following metrics:
784+
To create a `MicrometerMetrics` implementation that behaves how it has historically behaved, you can just create an
785+
instance via:
786+
787+
```java
788+
MeterRegistry registry= …;
789+
Metrics metrics=new MicrometerMetrics(registry)
790+
```
791+
792+
Note, however, that this constructor is deprecated and we encourage you to use the factory methods instead, which either
793+
return a fully pre-configured instance or a builder object that will allow you to configure more easily how the instance
794+
will behave. You can, for example, configure whether or not the implementation should collect metrics on a per-resource
795+
basis, whether or not associated meters should be removed when a resource is deleted and how the clean-up is performed.
796+
See the relevant classes documentation for more details.
786797

787-
| Meter name | Type | Tags | Description |
788-
|-----------------------------------------------------------|----------------|------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
789-
| operator.sdk.reconciliations.executions.<reconciler name> | gauge | group, version, kind | Number of executions of the named reconciler |
790-
| operator.sdk.reconciliations.queue.size.<reconciler name> | gauge | group, version, kind | How many resources are queued to get reconciled by named reconciler |
791-
| operator.sdk.<map name>.size | gauge map size | | Gauge tracking the size of a specified map (currently unused but could be used to monitor caches size) |
792-
| operator.sdk.events.received | counter | group, version, kind, name, namespace, scope, event, action | Number of received Kubernetes events |
793-
| operator.sdk.events.delete | counter | group, version, kind, name, namespace, scope | Number of received Kubernetes delete events |
794-
| operator.sdk.reconciliations.started | counter | group, version, kind, name, namespace, scope, reconciliations.retries.last, reconciliations.retries.number | Number of started reconciliations per resource type |
795-
| operator.sdk.reconciliations.failed | counter | group, version, kind, name, namespace, scope, exception | Number of failed reconciliations per resource type |
796-
| operator.sdk.reconciliations.success | counter | group, version, kind, name, namespace, scope | Number of successful reconciliations per resource type |
797-
| operator.sdk.controllers.execution.reconcile.success | counter | controller, type | Number of successful reconciliations per controller |
798-
| operator.sdk.controllers.execution.reconcile.failure | counter | controller, exception | Number of failed reconciliations per controller |
799-
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
800-
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |
801-
802-
As you can see all the recorded metrics start with the `operator.sdk` prefix.
798+
For example, the following will create a `MicrometerMetrics` instance configured to collect metrics on a per-resource
799+
basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.
800+
801+
```java
802+
MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
803+
.withCleanUpDelayInSeconds(5)
804+
.withCleaningThreadNumber(2)
805+
.build()
806+
```
807+
808+
The micrometer implementation records the following metrics:
803809

810+
| Meter name | Type | Tag names | Description |
811+
|-----------------------------------------------------------|----------------|-----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
812+
| operator.sdk.reconciliations.executions.<reconciler name> | gauge | group, version, kind | Number of executions of the named reconciler |
813+
| operator.sdk.reconciliations.queue.size.<reconciler name> | gauge | group, version, kind | How many resources are queued to get reconciled by named reconciler |
814+
| operator.sdk.<map name>.size | gauge map size | | Gauge tracking the size of a specified map (currently unused but could be used to monitor caches size) |
815+
| operator.sdk.events.received | counter | <resource metadata>, event, action | Number of received Kubernetes events |
816+
| operator.sdk.events.delete | counter | <resource metadata> | Number of received Kubernetes delete events |
817+
| operator.sdk.reconciliations.started | counter | <resource metadata>, reconciliations.retries.last, reconciliations.retries.number | Number of started reconciliations per resource type |
818+
| operator.sdk.reconciliations.failed | counter | <resource metadata>, exception | Number of failed reconciliations per resource type |
819+
| operator.sdk.reconciliations.success | counter | <resource metadata> | Number of successful reconciliations per resource type |
820+
| operator.sdk.controllers.execution.reconcile | timer | <resource metadata>, controller | Time taken for reconciliations per controller |
821+
| operator.sdk.controllers.execution.cleanup | timer | <resource metadata>, controller | Time taken for cleanups per controller |
822+
| operator.sdk.controllers.execution.reconcile.success | counter | controller, type | Number of successful reconciliations per controller |
823+
| operator.sdk.controllers.execution.reconcile.failure | counter | controller, exception | Number of failed reconciliations per controller |
824+
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
825+
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |
826+
827+
As you can see all the recorded metrics start with the `operator.sdk` prefix. `<resource metadata>`, in the table above,
828+
refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and
829+
could be summed up as follows: `group?, version, kind, [name, namespace?], scope` where the tags in square
830+
brackets (`[]`) won't be present when per-resource collection is disabled and tags followed by a question mark are
831+
omitted if the associated value is empty. Of note, when in the context of controllers' execution metrics, these tag
832+
names are prefixed with `resource.`. This prefix might be removed in a future version for greater consistency.
804833

805834
## Optimizing Caches
806835

0 commit comments

Comments
 (0)