Skip to content

Conversation

pstibrany
Copy link
Contributor

@pstibrany pstibrany commented Sep 1, 2020

What this PR does: This PR implements shuffle-sharding of queriers in the query-frontend.

Since each query frontend knows about all queriers that are connecting to it, QF can select N queriers that will be handling requests from specific users.

Implementation details:

  • Query-Frontend needs to know which queriers are connected, and group multiple connections. Protocol between QF and Q has been extended to support passing querierID. Old queriers will use empty string as ID. If new Querier connects to old Query-Frontend, it will not send its ID at all.

  • If the same set of queriers is connected to all frontends, these frontends will all select the same subset of queriers for the given user.

  • Each user has a different subset of queriers.

  • When searching for next user's queue to handle request from, each querier will fairly iterate between its set of users, in round robin fashion, as before.

  • Shuffle-sharding is enabled by setting either default (-frontend.max-queriers-per-user) or per-user limit ("max_queriers_per_user").

  • integration test

  • testing in dev cluster

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pstibrany pstibrany changed the title Shuffle-sharding of queriers Shuffle-sharding of queriers in the query-frontend Sep 1, 2020
@pstibrany pstibrany requested a review from pracucci September 1, 2020 08:43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could roll the same querier multiple times?

Copy link
Contributor Author

@pstibrany pstibrany Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it can. Alternative would be to make a copy of queriers, shuffle them and use first maxQueriers of them. Current version avoids this extra allocation at the cost of extra iterations. Not sure which one is better (time for a benchmark?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to use shuffling instead of trying random numbers. Updated benchmark shows little less time needed for BenchmarkQueueRequest (well, with +-7%, maybe not), but also little more memory allocations.

name              old time/op    new time/op    delta
GetNextRequest-4    59.7µs ± 4%    59.7µs ± 1%    ~     (p=0.645 n=10+9)
QueueRequest-4       626µs ± 7%     609µs ± 2%    ~     (p=0.105 n=10+10)

name              old alloc/op   new alloc/op   delta
GetNextRequest-4    1.60kB ± 0%    1.60kB ± 0%    ~     (all equal)
QueueRequest-4       322kB ± 0%     326kB ± 0%  +1.24%  (p=0.000 n=10+10)

name              old allocs/op  new allocs/op  delta
GetNextRequest-4       100 ± 0%       100 ± 0%    ~     (all equal)
QueueRequest-4       1.07k ± 0%     1.12k ± 0%  +4.68%  (p=0.000 n=10+10)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this logic. Don't we want to shuffle-shard each user onto n distinct queriers? Does this function n distinct queriers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Distinctness" is guaranteed by 1) using a map, 2) condition in the for loop. This loop doesn't end until there are maxQueriers distinct queriers in the map. In addition to that, thanks to len(allSortedQueriers) > maxQueriers precondition checked at the beginning of this method, we know that loop will eventually finish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this via repeated random selections versus shuffling the array seems like a probably unnecessary optimization at the expense of having a more deterministic outcome in terms of iterations required. I'm not sure it will matter practically, but seems like predictable would be better, meaning shuffle a copy of the array.

You don't have to shuffle the whole array, you can make a copy then from cnt over 0 to N first indices generate a random index between cnt+1 to N-1 inclisive and swap the value in the counter index with the random index so you can't ever select a duplicate, and you don't have to swap more values than the number you need to sellect. This is basically this: https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm except I described it from 0 up and they describe it from N-1 down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the implementation to use suggested algorithm. (Btw, it's the same what rand.Shuffle() does, but in our case we can stop early, and set elements in returned map without another iteration).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure how to quantify the effect, but this could unfairly skip users who don't happen to be assigned to the querier currently asking for a query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t quite follow here... skipping users who use different queriers is the goal of this PR.

# are connected to all frontends). Note that this only works with queriers
# connecting to the query-frontend, not when using downstream URL.
# CLI flag: -frontend.max-queriers-per-user
[max_queriers_per_user: <int> | default = 0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR for store-gateway shuffle sharding, a config called sharding_strategy was introduced. Is that something to be added here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store-gateways can use different sharding strategies. In case of queriers, no sharding was done before (which corresponds to using max_queriers_per_user: 0), and only shuffle-sharding is available (if max_queriers_per_user is greater than 0). We can introduce sharding_strategy option too, but personally I don't see a need for it here.

Copy link
Contributor

@pracucci pracucci Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought (not having a strong opinion): I agree it's not required here, but could help with config consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought (not having a strong opinion): I agree it's not required here, but could help with config consistency.

If we want to be consistent with other config options, we should also support "default" (non-shuffle-sharding) strategy with non-zero shard size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as is for now. We marked this feature as experimental, which will allow us to eventually fine-tune the config before marking it stable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this logic. Don't we want to shuffle-shard each user onto n distinct queriers? Does this function n distinct queriers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this via repeated random selections versus shuffling the array seems like a probably unnecessary optimization at the expense of having a more deterministic outcome in terms of iterations required. I'm not sure it will matter practically, but seems like predictable would be better, meaning shuffle a copy of the array.

You don't have to shuffle the whole array, you can make a copy then from cnt over 0 to N first indices generate a random index between cnt+1 to N-1 inclisive and swap the value in the counter index with the random index so you can't ever select a duplicate, and you don't have to swap more values than the number you need to sellect. This is basically this: https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm except I described it from 0 up and they describe it from N-1 down.

Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. Thanks for the array shuffling algorithm change.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic job 👏 I also appreciated a lot the tests! I left minor comments, but overall LGTM!

I would mention this is experimental in the "v1 guarantees" doc (but not add the experimental CLI flag prefix which we've already seen is a pain) in order to be able to do any breaking change until we got enough confidence running it in production.

I think we also mentioned we want aggressive gRPC keepalive settings for querier->query-frontend, in order to quickly detect "dead" queriers. Could you check current settings (if any) and eventually fine tune it, please?

In a separate PR, I would also work on some doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getOrAddQueue() can potentially return nil. I would check it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a check with comment that it can only happen if user is "".

Signed-off-by: Peter Štibraný <[email protected]>
Modified worker to respond to new GET_ID request type.

Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
This is similar to rnd.Shuffle(), but stops early
after selecting enough queriers.

Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
…from supplied input.

Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
@pstibrany
Copy link
Contributor Author

I would mention this is experimental in the "v1 guarantees" doc (but not add the experimental CLI flag prefix which we've already seen is a pain) in order to be able to do any breaking change until we got enough confidence running it in production.

I think we also mentioned we want aggressive gRPC keepalive settings for querier->query-frontend, in order to quickly detect "dead" queriers. Could you check current settings (if any) and eventually fine tune it, please?

We need to configure these settings on query-frontend. Default values that Cortex uses are:

  • -server.grpc.keepalive.time=2h
  • -server.grpc.keepalive.timeout=20s

which means that after 2 hours of no activity on the connection, server will ping the client and wait for 20 seconds for response. On no reply, connection is closed. We can include this information in the docs for people to tune (eg. time=1m/timeout=20s, or something that makes sense for their setup).

In a separate PR, I would also work on some doc.

👍

@pstibrany
Copy link
Contributor Author

@pracucci I've addressed your feedback, please take a look again when time permits. Thanks!

@pracucci
Copy link
Contributor

I've addressed your feedback, please take a look again when time permits. Thanks!

LGTM. Thanks to you! 🙏

Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
Signed-off-by: Peter Štibraný <[email protected]>
@pstibrany pstibrany merged commit b1ee0aa into cortexproject:master Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants