Skip to content

feat: now possible to only output non-resource related metrics #1823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Mar 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
4e5794f
feat: bounded cache for informers (#1718)
csviri Feb 21, 2023
6c5fafa
fix: typo caffein -> caffeine (#1795)
metacosm Mar 3, 2023
0ae01d1
feat: now possible to only output non-resource related metrics
metacosm Mar 15, 2023
f761dbc
refactor: extract abstract test fixture to add tests with variations
metacosm Mar 16, 2023
0d75b87
fix: add missing annotation
metacosm Mar 16, 2023
805fbe0
tests: add more test variations
metacosm Mar 16, 2023
76d25f9
fix: make operator non-static so it's registered once per test subclass
metacosm Mar 16, 2023
dd2d360
feat: introduce builder for MicrometerMetrics, fix test
metacosm Mar 16, 2023
57cf246
fix: exclude more tags when not collecting per resource
metacosm Mar 16, 2023
b51ea84
fix: registry should be per-instance to ensure test independence
metacosm Mar 17, 2023
9b14151
fix: make sure we wait a little to ensure event is properly processed
metacosm Mar 17, 2023
54ddeec
fix: make things work on Java 11, format
metacosm Mar 17, 2023
feb8b06
fix: also clean metrics on finalizer removal
metacosm Mar 17, 2023
edc9530
fix: format
metacosm Mar 17, 2023
6d14663
refactor: extract common tags
metacosm Mar 20, 2023
916d849
feat: make per-resource collecting finer-grained
metacosm Mar 20, 2023
bc5b5f4
fix: do not create tag for group if not present
metacosm Mar 20, 2023
445c891
fix: remove unreliable no-delay implementation, defaulting to 1s delay
metacosm Mar 22, 2023
7565e1b
refactor: renamed & documented factory methods to make things clearer
metacosm Mar 23, 2023
e22dc75
docs: updated metrics section for code changes
metacosm Mar 23, 2023
9c8d77e
feat: avoid emitting tag on empty value
metacosm Mar 23, 2023
2624f7b
docs: update
metacosm Mar 23, 2023
3fd613e
fix: format
metacosm Mar 23, 2023
ee88028
refactor: use Tag more directly, avoid unneeded work, use constants
metacosm Mar 24, 2023
40441f0
fix: change will happen instead of might
metacosm Mar 24, 2023
3bf2045
docs: add missing timer
metacosm Mar 24, 2023
70462a8
docs: fix wrong & missing information
metacosm Mar 24, 2023
4505761
refactor: add constants
metacosm Mar 24, 2023
31d1326
fix: wording
metacosm Mar 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 53 additions & 24 deletions docs/documentation/features.md
Original file line number Diff line number Diff line change
Expand Up @@ -774,33 +774,62 @@ ConfigurationServiceProvider.overrideCurrent(overrider->overrider.withMetrics(me

### Micrometer implementation

The micrometer implementation records a lot of metrics associated to each resource handled by the operator by default.
In order to be efficient, the implementation removes meters associated with resources when they are deleted. Since it
might be useful to keep these metrics around for a bit before they are deleted, it is possible to configure a delay
before their removal. As this is done asynchronously, it is also possible to configure how many threads you want to
devote to these operations. Both aspects are controlled by the `MicrometerMetrics` constructor so changing the defaults
is a matter of instantiating `MicrometerMetrics` with the desired values and tell `ConfigurationServiceProvider` about
it as shown above.
The micrometer implementation is typically created using one of the provided factory methods which, depending on which
is used, will return either a ready to use instance or a builder allowing users to customized how the implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't builder pattern be consistent so we have only one entry point to the configuration ? MicrometerMetrics.builder() and then be self documented with javadoc. By default, the builder is not configuring per resource and you can switch to the per resource builder if needed.

MicrometerMetrics.newMicrometerMetricsBuilder(new LoggingMeterRegistry())
    .collectingMetricsPerResource(perResourceBuilder ->
        perResourceBuilder.withCleanUpDelayInSeconds(60))
    .build();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having several entry points allows to provide often used configurations without users having to create them with the builder. It also allows to have stable API calls since these factory method implementations can be changed if we change the default behavior while the semantics of the method won't so I'd be in favor of keeping things like they are for now.

behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect
metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but
this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which
could lead to performance issues.

The micrometer implementation records the following metrics:
To create a `MicrometerMetrics` implementation that behaves how it has historically behaved, you can just create an
instance via:

```java
MeterRegistry registry= …;
Metrics metrics=new MicrometerMetrics(registry)
```

Note, however, that this constructor is deprecated and we encourage you to use the factory methods instead, which either
return a fully pre-configured instance or a builder object that will allow you to configure more easily how the instance
will behave. You can, for example, configure whether or not the implementation should collect metrics on a per-resource
basis, whether or not associated meters should be removed when a resource is deleted and how the clean-up is performed.
See the relevant classes documentation for more details.

| Meter name | Type | Tags | Description |
|-----------------------------------------------------------|----------------|------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| operator.sdk.reconciliations.executions.<reconciler name> | gauge | group, version, kind | Number of executions of the named reconciler |
| operator.sdk.reconciliations.queue.size.<reconciler name> | gauge | group, version, kind | How many resources are queued to get reconciled by named reconciler |
| operator.sdk.<map name>.size | gauge map size | | Gauge tracking the size of a specified map (currently unused but could be used to monitor caches size) |
| operator.sdk.events.received | counter | group, version, kind, name, namespace, scope, event, action | Number of received Kubernetes events |
| operator.sdk.events.delete | counter | group, version, kind, name, namespace, scope | Number of received Kubernetes delete events |
| operator.sdk.reconciliations.started | counter | group, version, kind, name, namespace, scope, reconciliations.retries.last, reconciliations.retries.number | Number of started reconciliations per resource type |
| operator.sdk.reconciliations.failed | counter | group, version, kind, name, namespace, scope, exception | Number of failed reconciliations per resource type |
| operator.sdk.reconciliations.success | counter | group, version, kind, name, namespace, scope | Number of successful reconciliations per resource type |
| operator.sdk.controllers.execution.reconcile.success | counter | controller, type | Number of successful reconciliations per controller |
| operator.sdk.controllers.execution.reconcile.failure | counter | controller, exception | Number of failed reconciliations per controller |
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |

As you can see all the recorded metrics start with the `operator.sdk` prefix.
For example, the following will create a `MicrometerMetrics` instance configured to collect metrics on a per-resource
basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.

```java
MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
.withCleanUpDelayInSeconds(5)
.withCleaningThreadNumber(2)
.build()
```

The micrometer implementation records the following metrics:

| Meter name | Type | Tag names | Description |
|-----------------------------------------------------------|----------------|-----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| operator.sdk.reconciliations.executions.<reconciler name> | gauge | group, version, kind | Number of executions of the named reconciler |
| operator.sdk.reconciliations.queue.size.<reconciler name> | gauge | group, version, kind | How many resources are queued to get reconciled by named reconciler |
| operator.sdk.<map name>.size | gauge map size | | Gauge tracking the size of a specified map (currently unused but could be used to monitor caches size) |
| operator.sdk.events.received | counter | <resource metadata>, event, action | Number of received Kubernetes events |
| operator.sdk.events.delete | counter | <resource metadata> | Number of received Kubernetes delete events |
| operator.sdk.reconciliations.started | counter | <resource metadata>, reconciliations.retries.last, reconciliations.retries.number | Number of started reconciliations per resource type |
| operator.sdk.reconciliations.failed | counter | <resource metadata>, exception | Number of failed reconciliations per resource type |
| operator.sdk.reconciliations.success | counter | <resource metadata> | Number of successful reconciliations per resource type |
| operator.sdk.controllers.execution.reconcile | timer | <resource metadata>, controller | Time taken for reconciliations per controller |
| operator.sdk.controllers.execution.cleanup | timer | <resource metadata>, controller | Time taken for cleanups per controller |
| operator.sdk.controllers.execution.reconcile.success | counter | controller, type | Number of successful reconciliations per controller |
| operator.sdk.controllers.execution.reconcile.failure | counter | controller, exception | Number of failed reconciliations per controller |
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |

As you can see all the recorded metrics start with the `operator.sdk` prefix. `<resource metadata>`, in the table above,
refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and
could be summed up as follows: `group?, version, kind, [name, namespace?], scope` where the tags in square
brackets (`[]`) won't be present when per-resource collection is disabled and tags followed by a question mark are
omitted if the associated value is empty. Of note, when in the context of controllers' execution metrics, these tag
names are prefixed with `resource.`. This prefix might be removed in a future version for greater consistency.

## Optimizing Caches

Expand Down
Loading