-
Notifications
You must be signed in to change notification settings - Fork 56
WIP: Adding general-purpose metric query endpoints to nexus #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Adds endpoint for getting a single timeseries schema by name - Adds endpoint for getting all timeseries schema - Adds endpoint for querying for timeseries - Moves the initialization of the ClickHouse database to a separate method, so consumers can make a client object without first requiring a connection to the DB. These are now part of the `DbWrite` trait, which provides some amount of separation between reading clients and the writable client that the oximeter collector has. - Update tests/config for new ClickHouse database client structure
04713c8
to
aec24f3
Compare
This branch is rough, but I'd like to get feedback before moving forward with it. This essentially exposes existing methods for querying timeseries as Nexus HTTP endpoints. The key knobs for filtering timeseries are the (target_name, metric_name), which is required, and zero or more filters on the fields and the time range. The schema for each timeseries can be returned from the |
#[derive(Deserialize, JsonSchema)] | ||
pub struct TimeseriesFilters { | ||
after: Option<DateTime<Utc>>, | ||
before: Option<DateTime<Utc>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer good old start
and end
if possible 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds fine.
*/ | ||
#[endpoint { | ||
method = GET, | ||
path = "/metrics/timeseries/{target}/{metric}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is an example of a target
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFDs 161 and 162 give a lot of background on this, but here's a brief overview for anyone:
A target is the thing we're monitoring or collecting data about: a software process, HTTP server, a sled, a DIMM, etc. The metric is the thing we're measuring about a target: number of requests to that HTTP server, the location of a sled in the rack, the temperature on a DIMM, etc.
Fields are used to identify a target and metric. They are specific attributes that distinguish one target from another of the same kind. A good example would be a name for an HTTP server, "nexus-internal"
vs "sled-agent"
, say.
A timeseries is a unique combination of a target and metric, i.e., a unique set of target fields and metric fields. It specifies a single set of timestamped samples which are all from the same measured feature (metric) of the same resource (target).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a target would not be a particular server, it’s just all http servers, and then you use fields to narrow down to the particular one you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The struct itself, the Rust type, is a schema. An instance of that struct identifies one individual target. So the target in your example would be the actual server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so what you're calling target
here is really just the target name, and properly speaking the Target
is the combination of the target name and the fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, yes. We can (should?) make that more explicit, using {target,metric}_name
for these path components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would help.
It's early, but here's something to think about as you work on this: if Target
is a key concept, I would like to see it made explicit as early as possible in the call chain. Currently you pass structured-ish data (it's a tuple of function arguments, some of which are strings, but dates and filters are parsed) down 4 levels of function call, and it only gets fully structured at the bottom.
Request to "/metrics/timeseries/{target}/{metric}" with fields in body
-> nexus.get_timeseries(target, metric, after, before, field_filters)
# timeseries_name is target_name+metric_name
-> timeseries_db.filter_timeseries_with(×eries_name, filters, after, before)
-> filter_timeseries(filter) # filter is timeseries_name + filters
# split timeseries_key on : and calls Target { } with target_name
-> reconstitute_from_schema(timeseries_key, schema)
That means the interface between Nexus and Oximeter is a bit fuzzy, and to understand what the arguments mean you have to dig inside Oximeter and see what's being done with them. It would help me to see a Target
constructed right in the route handler and passed to nexus.get_timeseries
, or if it doesn't feel right to do that in the handler, maybe inside nexus.get_timeseries
before it gets passed to oximeter_db.filter_timeseries_with
. This also lets you avoid some of the argument processing you're doing inside oximeter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'm happy to rename these path components to target_name
and metric_name
.
On your second point, I think there are a lot of pros and cons to resolving stringly-type field values to strongly-typed at different levels in the call stack. I appreciate you bringing it up, and I'll keep an eye out for these tradeoffs as I continue to build it out.
#[derive(Debug, Clone, Deserialize, Serialize, JsonSchema)] | ||
pub struct Filter { | ||
/// The name of the field. | ||
pub name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work and how does it relate to TimeFilter
? or is it early and I shouldn't worry about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a filter on one field of a timeseries, as a string formatted as name=value
. It filters timeseries to those having a field with that name and value. The TimeFilter
is for restricting the returned samples to a time range.
The Filter
and FieldFilter
object are more closely related. The Filter
is built from strings, and the client internally looks those up against the schema of the desired timeseries, and tries to convert the string value into the typed data for that field.
The point of this type was originally to support the command-line tool oxdb
, where every argument is a string initially. That might not make sense here. We could use the FieldFilter
object directly, it just means that the construction and verification of the field values would fall to the front-end. As an example, the console would need to build a UUID type from a string that the user entered.
async fn metrics_get_timeseries( | ||
rqctx: Arc<RequestContext<Arc<ServerContext>>>, | ||
path_params: Path<TimeseriesName>, | ||
body: TypedBody<TimeseriesFilters>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little unusual to use a request body with a GET. (It's hard for me to speak in generality but here's some comments about it.) Wouldn't it be more idiomatic to use query parameters here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I saw it, I assumed they were query params. So I agree, they should be query params. Hopefully Dropshot can handle nested objects in query params. Looks like Dropshot uses serde_urlencoded
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun fact, they're working on a QUERY
method that is basically a GET
with a body.
https://www.ietf.org/id/draft-ietf-httpbis-safe-method-w-body-02.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that link @davepacheco, I've never been clear on some of these finer semantic details. I'll move both the name and filters into query parameters, thanks.
let apictx = rqctx.context(); | ||
let nexus = &apictx.nexus; | ||
let handler = async { | ||
let schema = nexus.get_all_timeseries_schema().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this isn't paginated. I remember talking about this recently and I thought we were talking about an internal API. For the external API this seems like more of a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as you saw below, there's a TODO for this :)
Unfortunately, I think it's a big chunk of work. There's no way (that I can see) to paginate data from ClickHouse itself, which means we'd need to do the pagination in the oximeter_db
functions. I've not put a ton of thought into this, but it seems a tough because CH doesn't really have primary keys in the traditional sense. (You can have multiple records with the same primary key.)
This might be in the weeds, but I think the way to paginate would be to use the whole record (or nearly the whole thing), down to the timestamp. For reference, we'd use something like this tuple, which CH calls the sorting key. I'm really not sure how to put this into the queries, but it seems feasible if complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding right, that's basically what we're doing with Cockroach. We have a cursor that points to a particular row and ask for the next N records from that cursor. We have a concept of pagination params that encapsulates both the cursor and the sorting criteria in order to get stable results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-crespo Yeah, that is indeed how we paginate from Cockroach. I could be missing something, but it seems the difference is that in the case of Cockroach, the cursor that points to a particular row is the primary key, a single value that is unique by construction. ClickHouse just doesn't guarantee that kind of consistency, which means we might need to work harder to get a useful cursor. It might not even be possible, in the general case, since ClickHouse has no problem with records that have the same primary and sorting keys. You can insert the exact same record as many times as you want.
All of this is just to say, I don't know how to paginate directly from ClickHouse :)
EDIT: Actually, I don't think it's the primary key necessarily that we use as the cursor in CRDB, but it is a value that we have a unique index for, which has the same properties as the primary key: a uniqueness constraint and total ordering, allow us to construct a meaningful cursor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't use time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? The point is that there's nothing guaranteeing its uniqueness, though conflicts might be unlikely. I can certainly see situations where we have the multiple samples from a timeseries with the same timestamp, and samples from different timeseries with the same timestamp. Systems with low-resolution clocks in the former, and just dumb luck in the latter.
But time very well may be sufficiently unique for us, which I suggested in a comment above linking to the actual tuple we use for the sorting key. The timestamp is part of that, and so might be enough. I'll need to prototype to see :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, timestamps may not be unique. Looks like Clickhouse has UUIDs, which would work as a tiebreaker, but you're probably doubling the size of the records if you do that.
I'm going to close this PR unmerged. I'll be integrating it in chunks in later PRs, starting with #518. Paginating the timeseries data itself is more complicated, since dropshot doesn't support query parameter types that contain vectors. That work will come in follow-up PRs. |
method, so consumers can make a client object without first requiring
a connection to the DB. These are now part of the
DbWrite
trait,which provides some amount of separation between reading clients and
the writable client that the oximeter collector has.