Skip to content

Cache metric issues #1644

@darkowlzz

Description

@darkowlzz

This is to highlight the issue with the cache metrics and discuss how to
handle it with respect to the new generic typed cache.

At present, in source-controller we have a custom cache implementation for
storing the Helm repository index, refer
https://github.com/fluxcd/source-controller/tree/v1.4.1/internal/cache. This
cache can store any value in the form of an empty interface interface{}. For
instrumenting the cache, it exports a metric by the name
gotk_cache_events_total with labels event_type, name and namespace for
type of cache event (hit or miss), name of the Flux object for which the cache
is used and namespace of the object, respectively.

Following is how it appears when exported:

# HELP gotk_cache_events_total Total number of cache retrieval events for a Gitops Toolkit resource reconciliation.
# TYPE gotk_cache_events_total counter
gotk_cache_events_total{event_type="cache_hit",name="podinfo",namespace="default"} 3
gotk_cache_events_total{event_type="cache_miss",name="podinfo",namespace="default"} 1

If it is known that the only cache in source-controller is the Helm index cache,
this seems fine. But when we need to add other caches in source-controller for
other purposes, this metric gets confusing.

In https://github.com/fluxcd/pkg/tree/main/cache, a new generic typed cache is
implemented which will be used for different cache implementations across Flux.
See fluxcd/pkg#817 for the latest implementation of the
generic typed cache.

The following sections describe the issues with the existing metrics with
respect to the new generic typed cache.

Lack of object kind information

When a new cache instance (using the existing cache implementation), say auth
token cache, is added for GitRepository, OCIRepository, etc, due to lack of the
object kind in the metric label, we can't determine which object a metric
belongs to. Two objects of different kinds may have the same name and namespace.

The new generic typed cache also has a similar metric but it also includes the object
kind, which would help determine the associated object accurately.

Fixed metric name

The cache metric has hard-coded name, gotk_cache_events_total. If new cache
instance and metrics are added, they'll have the same name. The metric
registerer panics with the error duplicate metrics collector registration attempted when attempting to register two metrics with the same name.

This is true for the new generic typed cache too. The metric name is hard-coded.

In case of the old/existing cache implementation, the stored value is an empty
interface, which can store any value. This enables the usage of the same cache
for storing different items, although that may not be ideal, depending on the
use case. A single shared cache for everything would work fine.

But the new generic typed cache can't be shared because the cache instance are
created with their type and can only store values of specific type. This property of
the new cache requires creating separate caches for storing different types of data.
A cache for storing the helm index cache of type IndexFile. Another cache for
storing auth tokens of type string. The two with hard-coded metric names can't
be registered with the prometheus registerer.


Possible solutions

I believe ideally, the new cache would replace the existing cache in
source-controller. But for that we need to address the above issues. Using the
new generic typed cache will address the first issue as it includes the object kind
in the metric. But for addressing the fixed metric name issue, we may need to
allow dynamic/variable names.

The constructor of the cache metrics can accept a prefix which can be added
before the metric's fixed name. For example, if the fixed name is
cache_events_total, a prefix gotk_helm_index could be passed, resulting in
a new metric with name gotk_helm_index_cache_events_total. This would make is
easier to instrument individual caches and identify them easily. In case of the
helm index, all the metrics will belong to HelmRepository, but for auth token
cache, the same instance of cache may be shared amongst multiple reconcilers and
the object kind would help filter the metric for specific object kinds.

This would also benefit other users of our packages who may like to enable the
caching support. They can pass their own prefix, as gotk_cache wouldn't mean
anything to their users.

Concerns

  • Although we don't publish any monitoring dashboard that uses the existing
    cache metric, renaming the metric would break the existing monitoring setup of
    the users. For just helm index cache metrics, maybe we can find some way to
    continue exporting the metric with the same name, but we may eventually have to
    drop it.
  • For the auth token caches, should the cache be shared across the objects
    kinds/reconcilers? The considerations for this may not be based on the metrics,
    but the decision will affect the way metrics will be exported. If it's shared,
    the auth cache metric could be something like gotk_auth_cache_events_total.
    If it's not shared, it could be something like
    gotk_git_auth_cache_events_total, gotk_oci_auth_cache_events_total, etc.

The new generic typed cache implementation will be updated accordingly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions