WIP: Adding general-purpose metric query endpoints to nexus #415

bnaecker · 2021-11-22T20:16:51Z

Adds endpoint for getting a single timeseries schema by name
Adds endpoint for getting all timeseries schema
Adds endpoint for querying for timeseries
Moves the initialization of the ClickHouse database to a separate
method, so consumers can make a client object without first requiring
a connection to the DB. These are now part of the DbWrite trait,
which provides some amount of separation between reading clients and
the writable client that the oximeter collector has.
Update tests/config for new ClickHouse database client structure

- Adds endpoint for getting a single timeseries schema by name - Adds endpoint for getting all timeseries schema - Adds endpoint for querying for timeseries - Moves the initialization of the ClickHouse database to a separate method, so consumers can make a client object without first requiring a connection to the DB. These are now part of the `DbWrite` trait, which provides some amount of separation between reading clients and the writable client that the oximeter collector has. - Update tests/config for new ClickHouse database client structure

bnaecker · 2021-11-22T20:24:00Z

This branch is rough, but I'd like to get feedback before moving forward with it. This essentially exposes existing methods for querying timeseries as Nexus HTTP endpoints. The key knobs for filtering timeseries are the (target_name, metric_name), which is required, and zero or more filters on the fields and the time range. The schema for each timeseries can be returned from the /metrics/schema endpoint, so that the caller knows which fields to specify.

david-crespo · 2021-11-22T20:50:05Z

nexus/src/external_api/http_entrypoints.rs

+#[derive(Deserialize, JsonSchema)]
+pub struct TimeseriesFilters {
+    after: Option<DateTime<Utc>>,
+    before: Option<DateTime<Utc>>,


I prefer good old start and end if possible 😉

Sure, sounds fine.

nexus/src/config.rs

david-crespo · 2021-11-22T23:09:27Z

nexus/src/external_api/http_entrypoints.rs

+ */
+#[endpoint {
+    method = GET,
+    path = "/metrics/timeseries/{target}/{metric}",


what is an example of a target?

RFDs 161 and 162 give a lot of background on this, but here's a brief overview for anyone:

A target is the thing we're monitoring or collecting data about: a software process, HTTP server, a sled, a DIMM, etc. The metric is the thing we're measuring about a target: number of requests to that HTTP server, the location of a sled in the rack, the temperature on a DIMM, etc.

Fields are used to identify a target and metric. They are specific attributes that distinguish one target from another of the same kind. A good example would be a name for an HTTP server, "nexus-internal" vs "sled-agent", say.

A timeseries is a unique combination of a target and metric, i.e., a unique set of target fields and metric fields. It specifies a single set of timestamped samples which are all from the same measured feature (metric) of the same resource (target).

So a target would not be a particular server, it’s just all http servers, and then you use fields to narrow down to the particular one you want?

The struct itself, the Rust type, is a schema. An instance of that struct identifies one individual target. So the target in your example would be the actual server.

I see, so what you're calling target here is really just the target name, and properly speaking the Target is the combination of the target name and the fields?

That's fair, yes. We can (should?) make that more explicit, using {target,metric}_name for these path components.

That would help.

It's early, but here's something to think about as you work on this: if Target is a key concept, I would like to see it made explicit as early as possible in the call chain. Currently you pass structured-ish data (it's a tuple of function arguments, some of which are strings, but dates and filters are parsed) down 4 levels of function call, and it only gets fully structured at the bottom.

Request to "/metrics/timeseries/{target}/{metric}" with fields in body -> nexus.get_timeseries(target, metric, after, before, field_filters) # timeseries_name is target_name+metric_name -> timeseries_db.filter_timeseries_with(&timeseries_name, filters, after, before) -> filter_timeseries(filter) # filter is timeseries_name + filters # split timeseries_key on : and calls Target { } with target_name -> reconstitute_from_schema(timeseries_key, schema)

That means the interface between Nexus and Oximeter is a bit fuzzy, and to understand what the arguments mean you have to dig inside Oximeter and see what's being done with them. It would help me to see a Target constructed right in the route handler and passed to nexus.get_timeseries, or if it doesn't feel right to do that in the handler, maybe inside nexus.get_timeseries before it gets passed to oximeter_db.filter_timeseries_with. This also lets you avoid some of the argument processing you're doing inside oximeter.

Cool, I'm happy to rename these path components to target_name and metric_name.

On your second point, I think there are a lot of pros and cons to resolving stringly-type field values to strongly-typed at different levels in the call stack. I appreciate you bringing it up, and I'll keep an eye out for these tradeoffs as I continue to build it out.

david-crespo · 2021-11-22T23:10:36Z

oximeter/db/src/query.rs

+#[derive(Debug, Clone, Deserialize, Serialize, JsonSchema)]
 pub struct Filter {
    /// The name of the field.
    pub name: String,


how does this work and how does it relate to TimeFilter? or is it early and I shouldn't worry about it

This is a filter on one field of a timeseries, as a string formatted as name=value. It filters timeseries to those having a field with that name and value. The TimeFilter is for restricting the returned samples to a time range.

The Filter and FieldFilter object are more closely related. The Filter is built from strings, and the client internally looks those up against the schema of the desired timeseries, and tries to convert the string value into the typed data for that field.

The point of this type was originally to support the command-line tool oxdb, where every argument is a string initially. That might not make sense here. We could use the FieldFilter object directly, it just means that the construction and verification of the field values would fall to the front-end. As an example, the console would need to build a UUID type from a string that the user entered.

nexus/src/external_api/http_entrypoints.rs

davepacheco · 2021-11-24T16:48:42Z

nexus/src/external_api/http_entrypoints.rs

+async fn metrics_get_timeseries(
+    rqctx: Arc<RequestContext<Arc<ServerContext>>>,
+    path_params: Path<TimeseriesName>,
+    body: TypedBody<TimeseriesFilters>,


I think it's a little unusual to use a request body with a GET. (It's hard for me to speak in generality but here's some comments about it.) Wouldn't it be more idiomatic to use query parameters here?

When I saw it, I assumed they were query params. So I agree, they should be query params. Hopefully Dropshot can handle nested objects in query params. Looks like Dropshot uses serde_urlencoded.

Fun fact, they're working on a QUERY method that is basically a GET with a body.

https://www.ietf.org/id/draft-ietf-httpbis-safe-method-w-body-02.html

Thanks for that link @davepacheco, I've never been clear on some of these finer semantic details. I'll move both the name and filters into query parameters, thanks.

davepacheco · 2021-11-24T16:49:44Z

nexus/src/external_api/http_entrypoints.rs

+    let apictx = rqctx.context();
+    let nexus = &apictx.nexus;
+    let handler = async {
+        let schema = nexus.get_all_timeseries_schema().await?;


It looks like this isn't paginated. I remember talking about this recently and I thought we were talking about an internal API. For the external API this seems like more of a problem.

Yeah, as you saw below, there's a TODO for this :)

Unfortunately, I think it's a big chunk of work. There's no way (that I can see) to paginate data from ClickHouse itself, which means we'd need to do the pagination in the oximeter_db functions. I've not put a ton of thought into this, but it seems a tough because CH doesn't really have primary keys in the traditional sense. (You can have multiple records with the same primary key.)

This might be in the weeds, but I think the way to paginate would be to use the whole record (or nearly the whole thing), down to the timestamp. For reference, we'd use something like this tuple, which CH calls the sorting key. I'm really not sure how to put this into the queries, but it seems feasible if complicated.

If I'm understanding right, that's basically what we're doing with Cockroach. We have a cursor that points to a particular row and ask for the next N records from that cursor. We have a concept of pagination params that encapsulates both the cursor and the sorting criteria in order to get stable results.

@david-crespo Yeah, that is indeed how we paginate from Cockroach. I could be missing something, but it seems the difference is that in the case of Cockroach, the cursor that points to a particular row is the primary key, a single value that is unique by construction. ClickHouse just doesn't guarantee that kind of consistency, which means we might need to work harder to get a useful cursor. It might not even be possible, in the general case, since ClickHouse has no problem with records that have the same primary and sorting keys. You can insert the exact same record as many times as you want.

All of this is just to say, I don't know how to paginate directly from ClickHouse :)

EDIT: Actually, I don't think it's the primary key necessarily that we use as the cursor in CRDB, but it is a value that we have a unique index for, which has the same properties as the primary key: a uniqueness constraint and total ordering, allow us to construct a meaningful cursor.

You can't use time?

Maybe? The point is that there's nothing guaranteeing its uniqueness, though conflicts might be unlikely. I can certainly see situations where we have the multiple samples from a timeseries with the same timestamp, and samples from different timeseries with the same timestamp. Systems with low-resolution clocks in the former, and just dumb luck in the latter.

But time very well may be sufficiently unique for us, which I suggested in a comment above linking to the actual tuple we use for the sorting key. The timestamp is part of that, and so might be enough. I'll need to prototype to see :)

Got it, timestamps may not be unique. Looks like Clickhouse has UUIDs, which would work as a tiebreaker, but you're probably doubling the size of the records if you do that.

nexus/src/nexus.rs

bnaecker · 2021-12-15T19:03:31Z

I'm going to close this PR unmerged. I'll be integrating it in chunks in later PRs, starting with #518. Paginating the timeseries data itself is more complicated, since dropshot doesn't support query parameter types that contain vectors. That work will come in follow-up PRs.

bnaecker marked this pull request as draft November 22, 2021 20:16

bnaecker force-pushed the nexus/metrics-endpoints branch from 04713c8 to aec24f3 Compare November 22, 2021 20:17

bnaecker requested review from davepacheco, david-crespo and zephraph November 22, 2021 20:17

david-crespo reviewed Nov 22, 2021

View reviewed changes

zephraph reviewed Nov 22, 2021

View reviewed changes

nexus/src/config.rs Show resolved Hide resolved

david-crespo reviewed Nov 22, 2021

View reviewed changes

zephraph reviewed Nov 23, 2021

View reviewed changes

nexus/src/external_api/http_entrypoints.rs Show resolved Hide resolved

davepacheco reviewed Nov 24, 2021

View reviewed changes

davepacheco mentioned this pull request Dec 1, 2021

Mac test timeout #461

Closed

bnaecker closed this Dec 15, 2021

bnaecker deleted the nexus/metrics-endpoints branch July 23, 2022 03:30

WIP: Adding general-purpose metric query endpoints to nexus #415

WIP: Adding general-purpose metric query endpoints to nexus #415

Uh oh!

Conversation

bnaecker commented Nov 22, 2021

Uh oh!

bnaecker commented Nov 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnaecker Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnaecker Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnaecker commented Dec 15, 2021

Uh oh!

Uh oh!

bnaecker Nov 22, 2021 •

edited

Loading

david-crespo Nov 24, 2021 •

edited

Loading

bnaecker Nov 24, 2021 •

edited

Loading

david-crespo Nov 24, 2021 •

edited

Loading