Skip to content

Conversation

bnaecker
Copy link
Collaborator

  • Adds endpoint for getting a single timeseries schema by name
  • Adds endpoint for getting all timeseries schema
  • Adds endpoint for querying for timeseries
  • Moves the initialization of the ClickHouse database to a separate
    method, so consumers can make a client object without first requiring
    a connection to the DB. These are now part of the DbWrite trait,
    which provides some amount of separation between reading clients and
    the writable client that the oximeter collector has.
  • Update tests/config for new ClickHouse database client structure

@bnaecker bnaecker marked this pull request as draft November 22, 2021 20:16
- Adds endpoint for getting a single timeseries schema by name
- Adds endpoint for getting all timeseries schema
- Adds endpoint for querying for timeseries
- Moves the initialization of the ClickHouse database to a separate
  method, so consumers can make a client object without first requiring
  a connection to the DB. These are now part of the `DbWrite` trait,
  which provides some amount of separation between reading clients and
  the writable client that the oximeter collector has.
- Update tests/config for new ClickHouse database client structure
@bnaecker
Copy link
Collaborator Author

This branch is rough, but I'd like to get feedback before moving forward with it. This essentially exposes existing methods for querying timeseries as Nexus HTTP endpoints. The key knobs for filtering timeseries are the (target_name, metric_name), which is required, and zero or more filters on the fields and the time range. The schema for each timeseries can be returned from the /metrics/schema endpoint, so that the caller knows which fields to specify.

#[derive(Deserialize, JsonSchema)]
pub struct TimeseriesFilters {
after: Option<DateTime<Utc>>,
before: Option<DateTime<Utc>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer good old start and end if possible 😉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds fine.

*/
#[endpoint {
method = GET,
path = "/metrics/timeseries/{target}/{metric}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is an example of a target?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFDs 161 and 162 give a lot of background on this, but here's a brief overview for anyone:

A target is the thing we're monitoring or collecting data about: a software process, HTTP server, a sled, a DIMM, etc. The metric is the thing we're measuring about a target: number of requests to that HTTP server, the location of a sled in the rack, the temperature on a DIMM, etc.

Fields are used to identify a target and metric. They are specific attributes that distinguish one target from another of the same kind. A good example would be a name for an HTTP server, "nexus-internal" vs "sled-agent", say.

A timeseries is a unique combination of a target and metric, i.e., a unique set of target fields and metric fields. It specifies a single set of timestamped samples which are all from the same measured feature (metric) of the same resource (target).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a target would not be a particular server, it’s just all http servers, and then you use fields to narrow down to the particular one you want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct itself, the Rust type, is a schema. An instance of that struct identifies one individual target. So the target in your example would be the actual server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so what you're calling target here is really just the target name, and properly speaking the Target is the combination of the target name and the fields?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, yes. We can (should?) make that more explicit, using {target,metric}_name for these path components.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would help.

It's early, but here's something to think about as you work on this: if Target is a key concept, I would like to see it made explicit as early as possible in the call chain. Currently you pass structured-ish data (it's a tuple of function arguments, some of which are strings, but dates and filters are parsed) down 4 levels of function call, and it only gets fully structured at the bottom.

Request to "/metrics/timeseries/{target}/{metric}" with fields in body

-> nexus.get_timeseries(target, metric, after, before, field_filters)

# timeseries_name is target_name+metric_name
-> timeseries_db.filter_timeseries_with(&timeseries_name, filters, after, before)

-> filter_timeseries(filter) # filter is timeseries_name + filters

# split timeseries_key on : and calls Target { } with target_name
-> reconstitute_from_schema(timeseries_key, schema)

That means the interface between Nexus and Oximeter is a bit fuzzy, and to understand what the arguments mean you have to dig inside Oximeter and see what's being done with them. It would help me to see a Target constructed right in the route handler and passed to nexus.get_timeseries, or if it doesn't feel right to do that in the handler, maybe inside nexus.get_timeseries before it gets passed to oximeter_db.filter_timeseries_with. This also lets you avoid some of the argument processing you're doing inside oximeter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'm happy to rename these path components to target_name and metric_name.

On your second point, I think there are a lot of pros and cons to resolving stringly-type field values to strongly-typed at different levels in the call stack. I appreciate you bringing it up, and I'll keep an eye out for these tradeoffs as I continue to build it out.

#[derive(Debug, Clone, Deserialize, Serialize, JsonSchema)]
pub struct Filter {
/// The name of the field.
pub name: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work and how does it relate to TimeFilter? or is it early and I shouldn't worry about it

Copy link
Collaborator Author

@bnaecker bnaecker Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a filter on one field of a timeseries, as a string formatted as name=value. It filters timeseries to those having a field with that name and value. The TimeFilter is for restricting the returned samples to a time range.

The Filter and FieldFilter object are more closely related. The Filter is built from strings, and the client internally looks those up against the schema of the desired timeseries, and tries to convert the string value into the typed data for that field.

The point of this type was originally to support the command-line tool oxdb, where every argument is a string initially. That might not make sense here. We could use the FieldFilter object directly, it just means that the construction and verification of the field values would fall to the front-end. As an example, the console would need to build a UUID type from a string that the user entered.

async fn metrics_get_timeseries(
rqctx: Arc<RequestContext<Arc<ServerContext>>>,
path_params: Path<TimeseriesName>,
body: TypedBody<TimeseriesFilters>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little unusual to use a request body with a GET. (It's hard for me to speak in generality but here's some comments about it.) Wouldn't it be more idiomatic to use query parameters here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I saw it, I assumed they were query params. So I agree, they should be query params. Hopefully Dropshot can handle nested objects in query params. Looks like Dropshot uses serde_urlencoded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun fact, they're working on a QUERY method that is basically a GET with a body.

https://www.ietf.org/id/draft-ietf-httpbis-safe-method-w-body-02.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that link @davepacheco, I've never been clear on some of these finer semantic details. I'll move both the name and filters into query parameters, thanks.

let apictx = rqctx.context();
let nexus = &apictx.nexus;
let handler = async {
let schema = nexus.get_all_timeseries_schema().await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this isn't paginated. I remember talking about this recently and I thought we were talking about an internal API. For the external API this seems like more of a problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as you saw below, there's a TODO for this :)

Unfortunately, I think it's a big chunk of work. There's no way (that I can see) to paginate data from ClickHouse itself, which means we'd need to do the pagination in the oximeter_db functions. I've not put a ton of thought into this, but it seems a tough because CH doesn't really have primary keys in the traditional sense. (You can have multiple records with the same primary key.)

This might be in the weeds, but I think the way to paginate would be to use the whole record (or nearly the whole thing), down to the timestamp. For reference, we'd use something like this tuple, which CH calls the sorting key. I'm really not sure how to put this into the queries, but it seems feasible if complicated.

Copy link
Contributor

@david-crespo david-crespo Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding right, that's basically what we're doing with Cockroach. We have a cursor that points to a particular row and ask for the next N records from that cursor. We have a concept of pagination params that encapsulates both the cursor and the sorting criteria in order to get stable results.

Copy link
Collaborator Author

@bnaecker bnaecker Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-crespo Yeah, that is indeed how we paginate from Cockroach. I could be missing something, but it seems the difference is that in the case of Cockroach, the cursor that points to a particular row is the primary key, a single value that is unique by construction. ClickHouse just doesn't guarantee that kind of consistency, which means we might need to work harder to get a useful cursor. It might not even be possible, in the general case, since ClickHouse has no problem with records that have the same primary and sorting keys. You can insert the exact same record as many times as you want.

All of this is just to say, I don't know how to paginate directly from ClickHouse :)

EDIT: Actually, I don't think it's the primary key necessarily that we use as the cursor in CRDB, but it is a value that we have a unique index for, which has the same properties as the primary key: a uniqueness constraint and total ordering, allow us to construct a meaningful cursor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? The point is that there's nothing guaranteeing its uniqueness, though conflicts might be unlikely. I can certainly see situations where we have the multiple samples from a timeseries with the same timestamp, and samples from different timeseries with the same timestamp. Systems with low-resolution clocks in the former, and just dumb luck in the latter.

But time very well may be sufficiently unique for us, which I suggested in a comment above linking to the actual tuple we use for the sorting key. The timestamp is part of that, and so might be enough. I'll need to prototype to see :)

Copy link
Contributor

@david-crespo david-crespo Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, timestamps may not be unique. Looks like Clickhouse has UUIDs, which would work as a tiebreaker, but you're probably doubling the size of the records if you do that.

@davepacheco davepacheco mentioned this pull request Dec 1, 2021
@bnaecker
Copy link
Collaborator Author

I'm going to close this PR unmerged. I'll be integrating it in chunks in later PRs, starting with #518. Paginating the timeseries data itself is more complicated, since dropshot doesn't support query parameter types that contain vectors. That work will come in follow-up PRs.

@bnaecker bnaecker closed this Dec 15, 2021
@bnaecker bnaecker deleted the nexus/metrics-endpoints branch July 23, 2022 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants